Web Crawling with Python

The last two weeks we learned python and xpaths. We are now going to combine the two lessons in an attempt to write a web crawler. The following code below was initially written by Umer Javed, you can crawl to his link below ;)

from lxml import html
import requests

class AppCrawler:

    def __init__(self, starting_url, depth):
        self.starting_url = starting_url
        self.depth = depth
        self.current_depth = 0
        self.depth_links = []
        self.apps = []

    def crawl(self):
        app = self.get_app_from_link(self.starting_url)
        self.apps.append(app)
        self.depth_links.append(app.links)

        while self.current_depth < self.depth:
            current_links = []
            for link in self.depth_links[self.current_depth]:
                current_app = self.get_app_from_link(link)
                current_links.extend(current_app.links)
                self.apps.append(current_app)
                time.sleep(5)
            self.current_depth += 1
            self.depth_links.append(current_links)

    def get_app_from_link(self, link):
         start_page = requests.get(link)
         tree = html.fromstring(start_page.text)
       
         title = tree.xpath('//a[contains(text(),"Week 2 - What Are Buffer Overflows")]/text()')[0]
         links = tree.xpath('//a[contains(text(),"Week 2 - What Are Buffer Overflows")]/@href')
         app = App(title, links)
         return app

class App:

    def __init__(self, title, links):
        self.title = title
        self.links = links

    def __str__(self):
        return ("Title: " + self.title.encode('UTF-8') +
                    "\r\nLink: " + self.title.encode('UTF-8') + "\r\n")

crawler = AppCrawler('https://bufferoverflowattack.blogspot.com',0)
crawler.crawl()

for app in crawler.apps:
    print app

We will go over a high level perspective of the above app. A more detailed explanation can be found in the reference below. The AppCrawler class takes in the starting_url and the depth. These two arguments are passed to the crawl function. The crawl function then iterates through the provided links from the get_app_from_link function and increments a counter (current_depth) by 1 through each pass. This allows the user to specify how many levels of a site should be crawled.

This code was modified to crawl my own blog. You will notice a time.sleep function call. The point of the sleep function is to make sure a site does not blacklist you for performing too many GET requests at one time, or what would appear to be a denial of service (DOS). Please be aware and do not attempt to DOS my blog as I will find you!

Unfortunately I can only go one level deep on my blog. If I pass in a parameter of 1 for depth I get the following error in the stack trace:
line 32, in get_app_from_link title = tree.xpath('//a[contains(text(),"Week 2 - What Are Buffer Overflow")]/text()')[0] IndexError: list index out of range

Due to limited time I have not yet found the solution to this. The issue is that the title link on the first page is changed to a header on the next page and I am not sure how to grab the title in both cases using the same xpath.

This may be a good exercise for us to do. If I find the answer, I will reply to this post and you may do the same. If you think there is a better way to write this program, let me know via comments. Unfortunately, python is going to be shelved for now so that we may explore Kali Linux and its tools.

Javed, U. (2016, Jan 4). 6 - Crawling to Other Pages - Web Crawling with Python. Retrieved From https://www.youtube.com/watch?v=VqV4mxzqIbA&list=PLp55XiVzZ1_iUYan9C3NsWhdkw4Gbnovx&index=6

Comments

Popular posts from this blog

Covering Your Tracks

LDAP Vulnerabilities

Setting Up a Proxy to Protect Your Public IP (An Introduction to Proxies)