Introduction

Let’s say our website has gotten kind of big! The amount of links has increased substantially, and as we all know dead links are bad! It is now becoming increasingly tedious to check links manually. Our task ahead is to build a tool which searches our site, following links and ensuring they are still functioning. To do this in an efficient manner we will use the python threading module.

Getting started

Let’s create our project and install our dependencies. I will be using the Requests module for making http requests and Pythons inbuilt html parser for finding links on the page. We could use a more robust html parser such as Beautiful Soup, however for this project it seemed a little overkill.

mkdir link_checker && cd link_checker
virtualenv -p python3 .env
pip install requests
touch main.py

Our next task is to edit main.py and import our dependencies. Open main.py and create the basic structure for our program.

from html.parser import HTMLParser
import queue
import sys
import threading
import requests

class LinkChecker:
   def run(self):
       print('app running')

if __name__ == '__main__':
   app = LinkChecker()
   app.run()

If we run the program in its current state ‘app running’ should appear in the terminal.
We are importing 3 standard modules and one third party module (requests) which we installed earlier. What is it these modules do exactly?

Html parser

We will use the html.parser library to find all <a href=”some/url”> tags in our pages

Threading

Every time we create an HTTP request, our program stops and waits for the server to respond. This is a perfect opportunity to increase efficiency by using Pythons Threading module. A thread allows the CPU to work on different tasks seemingly at the same time. When waiting for a response from the server, our program will continue executing on a different thread.

Queue

We will use the python queue module to manage our threads. We will populate the queue with urls for our scraper to hit and threaded workers will begin by grabbing the url from the queue.

Sys

We will use the sys module to get arguments from the command line.

Lets flesh out our program a little:
Create an __init__() method on our class to check if the user has supplied a url from the command line and if not prompt them to enter one. We will also initialise our Queue.

class LinkChecker:
   def __init__(self, **kwargs):
       self.queue = queue.Queue()
       self.max_workers = 5
       self.root_url = self.get_url_arg()
       self.queue.put(self.root_url)

   def run(self):
       print('#### Link Checker ####')
       print('checking: {}'.format(self.root_url))

   def get_url_arg(self):
       # Get command line argument or prompt user for url
       url = None
       if len(sys.argv) > 1:
           url = sys.argv[1]
       else:
           url = input("please enter a url:")
       return url + '/'

if __name__ == '__main__':
   app = LinkChecker()
   app.run()

Now we can run the program and test that the entered url is printed to the screen.

Now let’s get on with creating some threaded workers and giving them some tasks to do. To do this we will create a thread for the each of the workers we want to activate. Each thread will be passed a callable which we will name worker. This callable will contain instructions for the thread to perform. Once a worker has finished all its tasks we will call queue.task_done(). This signals that the worker should fetch a new task from the queue. Next we call queue.join(). This function waits for all tasks in the queue to complete, after which the program will continue execution as per usual.

def run(self):
   '''Main run method of the program'''
   print('\n#### Link Checker ####')
   print('checking: {}'.format(self.root_url))

   # create a threads
   for _ in range(self.max_workers):
       t = threading.Thread(target=self.worker)
       t.daemon = True
       t.start()

   #wait for all tasks to complete
   self.queue.join()

   print('DONE')

def worker(self):
   while True:
       url = self.queue.get()
       try:
           res = requests.get(url)
           print('{}:  {}'.format(url, res.status_code))
           if self.root_url in url:
               self.get_a_tags(res)
       except Exception as e:
              print(e)
              pass
       self.queue.task_done()

To make sure we are only checking links on a specified site (and not sending our workers off on a wild goose chase all over the internet) we check that the domain is the same before crawling it for links.

At the moment our workers are just reporting the http status code of the supplied url. Let’s give them something a little more interesting to do. We will create a class inheriting from HTMLParser which searches out all <a> tags and returns the href value.

class HrefFinder(HTMLParser):
   hrefs = []
   def handle_starttag(self, tag, attrs):
       if tag == 'a':
           for a in attrs:
               if a[0] == 'href':
                   self.hrefs.append(a[1])

   def get_hrefs(self):
       return self.hrefs

 

In our main class we create an instance of HrefFinder and feed it the response text from our http request:

def get_a_tags(self, res):
   parser = HrefFinder()
   parser.feed(res.text)
   self.construct_urls(parser.get_hrefs())

Finally we will pass the resulting hrefs to our url constructor function. This function will check if the url is relative or absolute, if it is relative it will construct an absolute url using the root url argument passed to the program. We also use a Set() adding each url to make sure we don’t visit it more than once. The variable results will store our results

    def __init__(self):
       #--- Other stuff ---# 
       # use to check if a url has been seen or not
       self.seen = set()
       self.results = []

    def construct_urls(self, hrefs):
       # Construct valid absolute urls from href links
       for h in hrefs:
           url = None

           if 'http' in h or 'https' in h:
               url = '' + h
           else:
               url = self.root_url + h

           # Make sure we have not already checked the url and add it to our
           if url and url not in self.seen:
               self.seen.add(url)
               self.queue.put(url)

Let’s also change our worker method slightly in order to save our results instead of just printing them:

def worker(self):
   while True:
       url = self.queue.get()
       try:
           res = requests.get(url)
           self.results.append((url, res.status_code, requests.status_codes._codes[res.status_code][0],))
              
           # check that the site we are hitting is the domain specified through sys.argv[]
           if self.root_url in url:
               self.get_a_tags(res)
          
       except Exception as e:
           print('{}:  {}'.format(url, e))
           pass
       self.queue.task_done()

At this point our program can successfully do the following

  • Search our site for all links
  • Report whether the link is reachable or returns an error

This is pretty cool but imagine we now want to go ahead and fix the link. Our output isn’t very helpful to us as it does not tell us on which page the broken link exists.

Let’s change the code and instead of simply adding a url to our queue we will also add a referrer url. In other words the page on which the url was found. This makes our output much more useful.

At this point I also decided to refactor my code.

To ensure my class is reusable I move the get_url() method out to a stand alone function. Then pass the results to my classes __init__() function directly.

I create a try catch statement in my run method to catch any errors that the requests library might encounter. I also move some functionality from the construct_urls() method up to the run() method.

The final script can be seen below:

from html.parser import HTMLParser
import queue
import sys
import threading
import requests

class HrefFinder(HTMLParser):
   '''parse html and find href value of all  tags'''
   hrefs = []
   def handle_starttag(self, tag, attrs):
       if tag == 'a':
           for a in attrs:
               if a[0] == 'href':
                   self.hrefs.append(a[1])

   def get_hrefs(self):
       return self.hrefs

class LinkChecker:
   '''
   Searches a site for all links checking to see if they work
   Thread starts a worker which takes a url from the Queue,
   then creates a request and constructs a list of links from the returned page.
   '''

   def __init__(self, url, *args, **kwargs):
       self.max_workers = 5
       self.queue = queue.Queue()
       self.root_url = url
       self.queue.put(('', self.root_url,))

       # use to check if a url has been seen or not
       self.seen = set()

       # somewhere to store our results
       self.results = []

   def run(self):
       '''
       Main run method of the program
       creates threads and waits for them to complete
       '''
       print('\n#### Link Checker ####')
       print('checking: {}'.format(self.root_url))

       # create some threaded workers
       for _ in range(self.max_workers):
           t = threading.Thread(target=self._worker)
           t.daemon = True
           t.start()

       # wait for tasks in queue to finish
       self.queue.join()

       print('DONE')
       for r in self.results:
           print(r)

   def _worker(self):
       while True:
           url = self.queue.get()
           try:
               res = requests.get(url[1])
               self.results.append((url[0], url[1], res.status_code, requests.status_codes._codes[res.status_code][0],))
               # check that the site we are hitting is the domain specified through sys.argv
               if self.root_url in url[1]:
                   all_links = self._get_a_tags(res)
                   for l in all_links:
                       # Make sure we have not already checked the url and add it to our
                       if l not in self.seen:
                           self.seen.add(l)
                           self.queue.put((url[1], l,))
                       else:
                           continue
          
           except Exception as e:
               # in case the http request fails due to bad url or protocol
               self.results.append((url[0], url[1], e))

           # our task is complete!
           self.queue.task_done()

   def _get_a_tags(self, res):
       # feed response data to instance of HferFinder class
       parser = HrefFinder()
       parser.feed(res.text)
       return self._construct_urls(parser.get_hrefs())

   def _construct_urls(self, hrefs):
       # Construct valid absolute urls from href links on the page
       links = []
       for h in hrefs:
           url = None

           if 'http' in h or 'https' in h:
               url = '' + h
           else:
               url = self.root_url + h

           if url and 'mailto:' not in url:
               links.append(url)
           else:
               pass

       return links

def get_url_arg():
   # Get command line argument or prompt user for url
   url = None
   if len(sys.argv) > 1:
       url = sys.argv[1]
   else:
       url = input("please enter a url:")
   return url

if __name__ == '__main__':
   url = get_url_arg()
   app = LinkChecker(url)
   app.run()

 

Final thoughts

There are a number of ways we may decide to extend this application, some ideas include:

  • Output to a CSV report
  • Create a web application where the results are returned as html
  • Create a desktop application with a library such as Tkinter or Kivy