By Ethan Shearer May 22, 2019

Introduction
Let’s say our website has gotten kind of big! The amount of links has increased substantially, and as we all know dead links are bad! It is now becoming increasingly tedious to check links manually. Our task ahead is to build a tool which searches our site, following links and ensuring they are still functioning. To do this in an efficient manner we will use the python threading module.
Getting started
Let’s create our project and install our dependencies. I will be using the Requests module for making http requests and Pythons inbuilt html parser for finding links on the page. We could use a more robust html parser such as Beautiful Soup, however for this project it seemed a little overkill.
1mkdir link_checker && cd link_checker
2virtualenv -p python3 .env
3pip install requests
4touch main.py
Our next task is to edit main.py and import our dependencies. Open main.py and create the basic structure for our program.
1from html.parser import HTMLParser
2import queue
3import sys
4import threading
5import requests
6
7class LinkChecker:
8 def run(self):
9 print('app running')
10
11if __name__ == '__main__':
12 app = LinkChecker()
13 app.run()
If we run the program in its current state ‘app running’ should appear in the terminal. We are importing 3 standard modules and one third party module (requests) which we installed earlier. What is it these modules do exactly?
Html parser
We will use the html.parser library to find all <a href=”some/url”> tags in our pages
Threading
Every time we create an HTTP request, our program stops and waits for the server to respond. This is a perfect opportunity to increase efficiency by using Pythons Threading module. A thread allows the CPU to work on different tasks seemingly at the same time. When waiting for a response from the server, our program will continue executing on a different thread.
Queue
We will use the python queue module to manage our threads. We will populate the queue with urls for our scraper to hit and threaded workers will begin by grabbing the url from the queue.
Sys
We will use the sys module to get arguments from the command line.
Lets flesh out our program a little:
Create an __init__() method on our class to check if the user has supplied a url from the command line and if not prompt them to enter one. We will also initialise our Queue.
1class LinkChecker:
2 def __init__(self, **kwargs):
3 self.queue = queue.Queue()
4 self.max_workers = 5
5 self.root_url = self.get_url_arg()
6 self.queue.put(self.root_url)
7
8 def run(self):
9 print('#### Link Checker ####')
10 print('checking: {}'.format(self.root_url))
11
12 def get_url_arg(self):
13 # Get command line argument or prompt user for url
14 url = None
15 if len(sys.argv) > 1:
16 url = sys.argv[1]
17 else:
18 url = input("please enter a url:")
19 return url + '/'
20
21if __name__ == '__main__':
22 app = LinkChecker()
23 app.run()
Now we can run the program and test that the entered url is printed to the screen.
Now let’s get on with creating some threaded workers and giving them some tasks to do. To do this we will create a thread for the each of the workers we want to activate. Each thread will be passed a callable which we will name worker. This callable will contain instructions for the thread to perform. Once a worker has finished all its tasks we will call queue.task_done(). This signals that the worker should fetch a new task from the queue. Next we call queue.join(). This function waits for all tasks in the queue to complete, after which the program will continue execution as per usual.
1def run(self):
2 '''Main run method of the program'''
3 print('\n#### Link Checker ####')
4 print('checking: {}'.format(self.root_url))
5
6 # create a threads
7 for _ in range(self.max_workers):
8 t = threading.Thread(target=self.worker)
9 t.daemon = True
10 t.start()
11
12 #wait for all tasks to complete
13 self.queue.join()
14
15 print('DONE')
16
17def worker(self):
18 while True:
19 url = self.queue.get()
20 try:
21 res = requests.get(url)
22 print('{}: {}'.format(url, res.status_code))
23 if self.root_url in url:
24 self.get_a_tags(res)
25 except Exception as e:
26 print(e)
27 pass
28 self.queue.task_done()
To make sure we are only checking links on a specified site (and not sending our workers off on a wild goose chase all over the internet) we check that the domain is the same before crawling it for links.
At the moment our workers are just reporting the http status code of the supplied url. Let’s give them something a little more interesting to do. We will create a class inheriting from HTMLParser which searches out all <a> tags and returns the href value.
1class HrefFinder(HTMLParser):
2 hrefs = []
3 def handle_starttag(self, tag, attrs):
4 if tag == 'a':
5 for a in attrs:
6 if a[0] == 'href':
7 self.hrefs.append(a[1])
8
9 def get_hrefs(self):
10 return self.hrefs
In our main class we create an instance of HrefFinder and feed it the response text from our http request:
1def get_a_tags(self, res):
2 parser = HrefFinder()
3 parser.feed(res.text)
4 self.construct_urls(parser.get_hrefs())
Finally we will pass the resulting hrefs to our url constructor function. This function will check if the url is relative or absolute, if it is relative it will construct an absolute url using the root url argument passed to the program. We also use a Set() adding each url to make sure we don’t visit it more than once. The variable results will store our results
1 def __init__(self):
2 #--- Other stuff ---#
3 # use to check if a url has been seen or not
4 self.seen = set()
5 self.results = []
6
7 def construct_urls(self, hrefs):
8 # Construct valid absolute urls from href links
9 for h in hrefs:
10 url = None
11
12 if 'http' in h or 'https' in h:
13 url = '' + h
14 else:
15 url = self.root_url + h
16
17 # Make sure we have not already checked the url and add it to our
18 if url and url not in self.seen:
19 self.seen.add(url)
20 self.queue.put(url)
Let’s also change our worker method slightly in order to save our results instead of just printing them:
1def worker(self):
2 while True:
3 url = self.queue.get()
4 try:
5 res = requests.get(url)
6 self.results.append((url, res.status_code, requests.status_codes._codes[res.status_code][0],))
7
8 # check that the site we are hitting is the domain specified through sys.argv[]
9 if self.root_url in url:
10 self.get_a_tags(res)
11
12 except Exception as e:
13 print('{}: {}'.format(url, e))
14 pass
15 self.queue.task_done()
At this point our program can successfully do the following
- Search our site for all links
- Report whether the link is reachable or returns an error
This is pretty cool but imagine we now want to go ahead and fix the link. Our output isn’t very helpful to us as it does not tell us on which page the broken link exists.
Let’s change the code and instead of simply adding a url to our queue we will also add a referrer url. In other words the page on which the url was found. This makes our output much more useful.
At this point I also decided to refactor my code.
To ensure my class is reusable I move the get_url() method out to a stand alone function. Then pass the results to my classes __init__() function directly.
I create a try catch statement in my run method to catch any errors that the requests library might encounter. I also move some functionality from the construct_urls() method up to the run() method.
The final script can be seen below:
1from html.parser import HTMLParser
2import queue
3import sys
4import threading
5import requests
6
7class HrefFinder(HTMLParser):
8 '''parse html and find href value of all tags'''
9 hrefs = []
10 def handle_starttag(self, tag, attrs):
11 if tag == 'a':
12 for a in attrs:
13 if a[0] == 'href':
14 self.hrefs.append(a[1])
15
16 def get_hrefs(self):
17 return self.hrefs
18
19class LinkChecker:
20 '''
21 Searches a site for all links checking to see if they work
22 Thread starts a worker which takes a url from the Queue,
23 then creates a request and constructs a list of links from the returned page.
24 '''
25
26 def __init__(self, url, *args, **kwargs):
27 self.max_workers = 5
28 self.queue = queue.Queue()
29 self.root_url = url
30 self.queue.put(('', self.root_url,))
31
32 # use to check if a url has been seen or not
33 self.seen = set()
34
35 # somewhere to store our results
36 self.results = []
37
38 def run(self):
39 '''
40 Main run method of the program
41 creates threads and waits for them to complete
42 '''
43 print('\n#### Link Checker ####')
44 print('checking: {}'.format(self.root_url))
45
46 # create some threaded workers
47 for _ in range(self.max_workers):
48 t = threading.Thread(target=self._worker)
49 t.daemon = True
50 t.start()
51
52 # wait for tasks in queue to finish
53 self.queue.join()
54
55 print('DONE')
56 for r in self.results:
57 print(r)
58
59 def _worker(self):
60 while True:
61 url = self.queue.get()
62 try:
63 res = requests.get(url[1])
64 self.results.append((url[0], url[1], res.status_code, requests.status_codes._codes[res.status_code][0],))
65 # check that the site we are hitting is the domain specified through sys.argv
66 if self.root_url in url[1]:
67 all_links = self._get_a_tags(res)
68 for l in all_links:
69 # Make sure we have not already checked the url and add it to our
70 if l not in self.seen:
71 self.seen.add(l)
72 self.queue.put((url[1], l,))
73 else:
74 continue
75
76 except Exception as e:
77 # in case the http request fails due to bad url or protocol
78 self.results.append((url[0], url[1], e))
79
80 # our task is complete!
81 self.queue.task_done()
82
83 def _get_a_tags(self, res):
84 # feed response data to instance of HferFinder class
85 parser = HrefFinder()
86 parser.feed(res.text)
87 return self._construct_urls(parser.get_hrefs())
88
89 def _construct_urls(self, hrefs):
90 # Construct valid absolute urls from href links on the page
91 links = []
92 for h in hrefs:
93 url = None
94
95 if 'http' in h or 'https' in h:
96 url = '' + h
97 else:
98 url = self.root_url + h
99
100 if url and 'mailto:' not in url:
101 links.append(url)
102 else:
103 pass
104
105 return links
106
107def get_url_arg():
108 # Get command line argument or prompt user for url
109 url = None
110 if len(sys.argv) > 1:
111 url = sys.argv[1]
112 else:
113 url = input("please enter a url:")
114 return url
115
116if __name__ == '__main__':
117 url = get_url_arg()
118 app = LinkChecker(url)
119 app.run()
Final thoughts
There are a number of ways we may decide to extend this application, some ideas include:
- Output to a CSV report
- Create a web application where the results are returned as html
- Create a desktop application with a library such as Tkinter or Kivy