Python threads: Building a link validation tool

By Ethan Shearer May 22, 2019

Post title

Introduction

Let’s say our website has gotten kind of big! The amount of links has increased substantially, and as we all know dead links are bad! It is now becoming increasingly tedious to check links manually. Our task ahead is to build a tool which searches our site, following links and ensuring they are still functioning. To do this in an efficient manner we will use the python threading module.

Getting started

Let’s create our project and install our dependencies. I will be using the Requests module for making http requests and Pythons inbuilt html parser for finding links on the page. We could use a more robust html parser such as Beautiful Soup, however for this project it seemed a little overkill.

1mkdir link_checker && cd link_checker
2virtualenv -p python3 .env
3pip install requests
4touch main.py

Our next task is to edit main.py and import our dependencies. Open main.py and create the basic structure for our program.

 1from html.parser import HTMLParser
 2import queue
 3import sys
 4import threading
 5import requests
 6
 7class LinkChecker:
 8   def run(self):
 9       print('app running')
10
11if __name__ == '__main__':
12   app = LinkChecker()
13   app.run()

If we run the program in its current state ‘app running’ should appear in the terminal. We are importing 3 standard modules and one third party module (requests) which we installed earlier. What is it these modules do exactly?

Html parser

We will use the html.parser library to find all <a href=”some/url”> tags in our pages

Threading

Every time we create an HTTP request, our program stops and waits for the server to respond. This is a perfect opportunity to increase efficiency by using Pythons Threading module. A thread allows the CPU to work on different tasks seemingly at the same time. When waiting for a response from the server, our program will continue executing on a different thread.

Queue

We will use the python queue module to manage our threads. We will populate the queue with urls for our scraper to hit and threaded workers will begin by grabbing the url from the queue.

Sys

We will use the sys module to get arguments from the command line.

Lets flesh out our program a little: Create an __init__() method on our class to check if the user has supplied a url from the command line and if not prompt them to enter one. We will also initialise our Queue.

 1class LinkChecker:
 2   def __init__(self, **kwargs):
 3       self.queue = queue.Queue()
 4       self.max_workers = 5
 5       self.root_url = self.get_url_arg()
 6       self.queue.put(self.root_url)
 7
 8   def run(self):
 9       print('#### Link Checker ####')
10       print('checking: {}'.format(self.root_url))
11
12   def get_url_arg(self):
13       # Get command line argument or prompt user for url
14       url = None
15       if len(sys.argv) > 1:
16           url = sys.argv[1]
17       else:
18           url = input("please enter a url:")
19       return url + '/'
20
21if __name__ == '__main__':
22   app = LinkChecker()
23   app.run()

Now we can run the program and test that the entered url is printed to the screen.

Now let’s get on with creating some threaded workers and giving them some tasks to do. To do this we will create a thread for the each of the workers we want to activate. Each thread will be passed a callable which we will name worker. This callable will contain instructions for the thread to perform. Once a worker has finished all its tasks we will call queue.task_done(). This signals that the worker should fetch a new task from the queue. Next we call queue.join(). This function waits for all tasks in the queue to complete, after which the program will continue execution as per usual.

 1def run(self):
 2   '''Main run method of the program'''
 3   print('\n#### Link Checker ####')
 4   print('checking: {}'.format(self.root_url))
 5
 6   # create a threads
 7   for _ in range(self.max_workers):
 8       t = threading.Thread(target=self.worker)
 9       t.daemon = True
10       t.start()
11
12   #wait for all tasks to complete
13   self.queue.join()
14
15   print('DONE')
16
17def worker(self):
18   while True:
19       url = self.queue.get()
20       try:
21           res = requests.get(url)
22           print('{}:  {}'.format(url, res.status_code))
23           if self.root_url in url:
24               self.get_a_tags(res)
25       except Exception as e:
26              print(e)
27              pass
28       self.queue.task_done()

To make sure we are only checking links on a specified site (and not sending our workers off on a wild goose chase all over the internet) we check that the domain is the same before crawling it for links.

At the moment our workers are just reporting the http status code of the supplied url. Let’s give them something a little more interesting to do. We will create a class inheriting from HTMLParser which searches out all <a> tags and returns the href value.

 1class HrefFinder(HTMLParser):
 2   hrefs = []
 3   def handle_starttag(self, tag, attrs):
 4       if tag == 'a':
 5           for a in attrs:
 6               if a[0] == 'href':
 7                   self.hrefs.append(a[1])
 8
 9   def get_hrefs(self):
10       return self.hrefs

In our main class we create an instance of HrefFinder and feed it the response text from our http request:

1def get_a_tags(self, res):
2   parser = HrefFinder()
3   parser.feed(res.text)
4   self.construct_urls(parser.get_hrefs())

Finally we will pass the resulting hrefs to our url constructor function. This function will check if the url is relative or absolute, if it is relative it will construct an absolute url using the root url argument passed to the program. We also use a Set() adding each url to make sure we don’t visit it more than once. The variable results will store our results

 1    def __init__(self):
 2       #--- Other stuff ---# 
 3       # use to check if a url has been seen or not
 4       self.seen = set()
 5       self.results = []
 6
 7    def construct_urls(self, hrefs):
 8       # Construct valid absolute urls from href links
 9       for h in hrefs:
10           url = None
11
12           if 'http' in h or 'https' in h:
13               url = '' + h
14           else:
15               url = self.root_url + h
16
17           # Make sure we have not already checked the url and add it to our
18           if url and url not in self.seen:
19               self.seen.add(url)
20               self.queue.put(url)

Let’s also change our worker method slightly in order to save our results instead of just printing them:

 1def worker(self):
 2   while True:
 3       url = self.queue.get()
 4       try:
 5           res = requests.get(url)
 6           self.results.append((url, res.status_code, requests.status_codes._codes[res.status_code][0],))
 7              
 8           # check that the site we are hitting is the domain specified through sys.argv[]
 9           if self.root_url in url:
10               self.get_a_tags(res)
11          
12       except Exception as e:
13           print('{}:  {}'.format(url, e))
14           pass
15       self.queue.task_done()

At this point our program can successfully do the following

Search our site for all links
Report whether the link is reachable or returns an error

This is pretty cool but imagine we now want to go ahead and fix the link. Our output isn’t very helpful to us as it does not tell us on which page the broken link exists.

Let’s change the code and instead of simply adding a url to our queue we will also add a referrer url. In other words the page on which the url was found. This makes our output much more useful.

At this point I also decided to refactor my code.

To ensure my class is reusable I move the get_url() method out to a stand alone function. Then pass the results to my classes __init__() function directly.

I create a try catch statement in my run method to catch any errors that the requests library might encounter. I also move some functionality from the construct_urls() method up to the run() method.

The final script can be seen below:

  1from html.parser import HTMLParser
  2import queue
  3import sys
  4import threading
  5import requests
  6
  7class HrefFinder(HTMLParser):
  8   '''parse html and find href value of all  tags'''
  9   hrefs = []
 10   def handle_starttag(self, tag, attrs):
 11       if tag == 'a':
 12           for a in attrs:
 13               if a[0] == 'href':
 14                   self.hrefs.append(a[1])
 15
 16   def get_hrefs(self):
 17       return self.hrefs
 18
 19class LinkChecker:
 20   '''
 21   Searches a site for all links checking to see if they work
 22   Thread starts a worker which takes a url from the Queue,
 23   then creates a request and constructs a list of links from the returned page.
 24   '''
 25
 26   def __init__(self, url, *args, **kwargs):
 27       self.max_workers = 5
 28       self.queue = queue.Queue()
 29       self.root_url = url
 30       self.queue.put(('', self.root_url,))
 31
 32       # use to check if a url has been seen or not
 33       self.seen = set()
 34
 35       # somewhere to store our results
 36       self.results = []
 37
 38   def run(self):
 39       '''
 40       Main run method of the program
 41       creates threads and waits for them to complete
 42       '''
 43       print('\n#### Link Checker ####')
 44       print('checking: {}'.format(self.root_url))
 45
 46       # create some threaded workers
 47       for _ in range(self.max_workers):
 48           t = threading.Thread(target=self._worker)
 49           t.daemon = True
 50           t.start()
 51
 52       # wait for tasks in queue to finish
 53       self.queue.join()
 54
 55       print('DONE')
 56       for r in self.results:
 57           print(r)
 58
 59   def _worker(self):
 60       while True:
 61           url = self.queue.get()
 62           try:
 63               res = requests.get(url[1])
 64               self.results.append((url[0], url[1], res.status_code, requests.status_codes._codes[res.status_code][0],))
 65               # check that the site we are hitting is the domain specified through sys.argv
 66               if self.root_url in url[1]:
 67                   all_links = self._get_a_tags(res)
 68                   for l in all_links:
 69                       # Make sure we have not already checked the url and add it to our
 70                       if l not in self.seen:
 71                           self.seen.add(l)
 72                           self.queue.put((url[1], l,))
 73                       else:
 74                           continue
 75          
 76           except Exception as e:
 77               # in case the http request fails due to bad url or protocol
 78               self.results.append((url[0], url[1], e))
 79
 80           # our task is complete!
 81           self.queue.task_done()
 82
 83   def _get_a_tags(self, res):
 84       # feed response data to instance of HferFinder class
 85       parser = HrefFinder()
 86       parser.feed(res.text)
 87       return self._construct_urls(parser.get_hrefs())
 88
 89   def _construct_urls(self, hrefs):
 90       # Construct valid absolute urls from href links on the page
 91       links = []
 92       for h in hrefs:
 93           url = None
 94
 95           if 'http' in h or 'https' in h:
 96               url = '' + h
 97           else:
 98               url = self.root_url + h
 99
100           if url and 'mailto:' not in url:
101               links.append(url)
102           else:
103               pass
104
105       return links
106
107def get_url_arg():
108   # Get command line argument or prompt user for url
109   url = None
110   if len(sys.argv) > 1:
111       url = sys.argv[1]
112   else:
113       url = input("please enter a url:")
114   return url
115
116if __name__ == '__main__':
117   url = get_url_arg()
118   app = LinkChecker(url)
119   app.run()

Final thoughts

There are a number of ways we may decide to extend this application, some ideas include:

Output to a CSV report
Create a web application where the results are returned as html
Create a desktop application with a library such as Tkinter or Kivy

EthanShearer.com