Introduction
Let’s say our website has gotten kind of big! The amount of links has increased substantially, and as we all know dead links are bad! It is now becoming increasingly tedious to check links manually. Our task ahead is to build a tool which searches our site, following links and ensuring they are still functioning. To do this in an efficient manner we will use the python threading module.
Getting started
Let’s create our project and install our dependencies. I will be using the Requests module for making http requests and Pythons inbuilt html parser for finding links on the page. We could use a more robust html parser such as Beautiful Soup, however for this project it seemed a little overkill.
mkdir link_checker && cd link_checker
virtualenv -p python3 .env
pip install requests
touch main.py
Our next task is to edit main.py and import our dependencies. Open main.py and create the basic structure for our program.
from html.parser import HTMLParser
import queue
import sys
import threading
import requests
class LinkChecker:
def run(self):
print('app running')
if __name__ == '__main__':
app = LinkChecker()
app.run()
If we run the program in its current state ‘app running’ should appear in the terminal.
We are importing 3 standard modules and one third party module (requests) which we installed earlier. What is it these modules do exactly?
Html parser
We will use the html.parser library to find all <a href=”some/url”> tags in our pages
Threading
Every time we create an HTTP request, our program stops and waits for the server to respond. This is a perfect opportunity to increase efficiency by using Pythons Threading module. A thread allows the CPU to work on different tasks seemingly at the same time. When waiting for a response from the server, our program will continue executing on a different thread.
Queue
We will use the python queue module to manage our threads. We will populate the queue with urls for our scraper to hit and threaded workers will begin by grabbing the url from the queue.
Sys
We will use the sys module to get arguments from the command line.
Lets flesh out our program a little:
Create an __init__() method on our class to check if the user has supplied a url from the command line and if not prompt them to enter one. We will also initialise our Queue.
class LinkChecker:
def __init__(self, **kwargs):
self.queue = queue.Queue()
self.max_workers = 5
self.root_url = self.get_url_arg()
self.queue.put(self.root_url)
def run(self):
print('#### Link Checker ####')
print('checking: {}'.format(self.root_url))
def get_url_arg(self):
# Get command line argument or prompt user for url
url = None
if len(sys.argv) > 1:
url = sys.argv[1]
else:
url = input("please enter a url:")
return url + '/'
if __name__ == '__main__':
app = LinkChecker()
app.run()
Now we can run the program and test that the entered url is printed to the screen.
Now let’s get on with creating some threaded workers and giving them some tasks to do. To do this we will create a thread for the each of the workers we want to activate. Each thread will be passed a callable which we will name worker. This callable will contain instructions for the thread to perform. Once a worker has finished all its tasks we will call queue.task_done(). This signals that the worker should fetch a new task from the queue. Next we call queue.join(). This function waits for all tasks in the queue to complete, after which the program will continue execution as per usual.
def run(self):
'''Main run method of the program'''
print('\n#### Link Checker ####')
print('checking: {}'.format(self.root_url))
# create a threads
for _ in range(self.max_workers):
t = threading.Thread(target=self.worker)
t.daemon = True
t.start()
#wait for all tasks to complete
self.queue.join()
print('DONE')
def worker(self):
while True:
url = self.queue.get()
try:
res = requests.get(url)
print('{}: {}'.format(url, res.status_code))
if self.root_url in url:
self.get_a_tags(res)
except Exception as e:
print(e)
pass
self.queue.task_done()
To make sure we are only checking links on a specified site (and not sending our workers off on a wild goose chase all over the internet) we check that the domain is the same before crawling it for links.
At the moment our workers are just reporting the http status code of the supplied url. Let’s give them something a little more interesting to do. We will create a class inheriting from HTMLParser which searches out all <a> tags and returns the href value.
class HrefFinder(HTMLParser):
hrefs = []
def handle_starttag(self, tag, attrs):
if tag == 'a':
for a in attrs:
if a[0] == 'href':
self.hrefs.append(a[1])
def get_hrefs(self):
return self.hrefs
In our main class we create an instance of HrefFinder and feed it the response text from our http request:
def get_a_tags(self, res):
parser = HrefFinder()
parser.feed(res.text)
self.construct_urls(parser.get_hrefs())
Finally we will pass the resulting hrefs to our url constructor function. This function will check if the url is relative or absolute, if it is relative it will construct an absolute url using the root url argument passed to the program. We also use a Set() adding each url to make sure we don’t visit it more than once. The variable results will store our results
def __init__(self):
#--- Other stuff ---#
# use to check if a url has been seen or not
self.seen = set()
self.results = []
def construct_urls(self, hrefs):
# Construct valid absolute urls from href links
for h in hrefs:
url = None
if 'http' in h or 'https' in h:
url = '' + h
else:
url = self.root_url + h
# Make sure we have not already checked the url and add it to our
if url and url not in self.seen:
self.seen.add(url)
self.queue.put(url)
Let’s also change our worker method slightly in order to save our results instead of just printing them:
def worker(self):
while True:
url = self.queue.get()
try:
res = requests.get(url)
self.results.append((url, res.status_code, requests.status_codes._codes[res.status_code][0],))
# check that the site we are hitting is the domain specified through sys.argv[]
if self.root_url in url:
self.get_a_tags(res)
except Exception as e:
print('{}: {}'.format(url, e))
pass
self.queue.task_done()
At this point our program can successfully do the following
- Search our site for all links
- Report whether the link is reachable or returns an error
This is pretty cool but imagine we now want to go ahead and fix the link. Our output isn’t very helpful to us as it does not tell us on which page the broken link exists.
Let’s change the code and instead of simply adding a url to our queue we will also add a referrer url. In other words the page on which the url was found. This makes our output much more useful.
At this point I also decided to refactor my code.
To ensure my class is reusable I move the get_url() method out to a stand alone function. Then pass the results to my classes __init__() function directly.
I create a try catch statement in my run method to catch any errors that the requests library might encounter. I also move some functionality from the construct_urls() method up to the run() method.
The final script can be seen below:
from html.parser import HTMLParser
import queue
import sys
import threading
import requests
class HrefFinder(HTMLParser):
'''parse html and find href value of all tags'''
hrefs = []
def handle_starttag(self, tag, attrs):
if tag == 'a':
for a in attrs:
if a[0] == 'href':
self.hrefs.append(a[1])
def get_hrefs(self):
return self.hrefs
class LinkChecker:
'''
Searches a site for all links checking to see if they work
Thread starts a worker which takes a url from the Queue,
then creates a request and constructs a list of links from the returned page.
'''
def __init__(self, url, *args, **kwargs):
self.max_workers = 5
self.queue = queue.Queue()
self.root_url = url
self.queue.put(('', self.root_url,))
# use to check if a url has been seen or not
self.seen = set()
# somewhere to store our results
self.results = []
def run(self):
'''
Main run method of the program
creates threads and waits for them to complete
'''
print('\n#### Link Checker ####')
print('checking: {}'.format(self.root_url))
# create some threaded workers
for _ in range(self.max_workers):
t = threading.Thread(target=self._worker)
t.daemon = True
t.start()
# wait for tasks in queue to finish
self.queue.join()
print('DONE')
for r in self.results:
print(r)
def _worker(self):
while True:
url = self.queue.get()
try:
res = requests.get(url[1])
self.results.append((url[0], url[1], res.status_code, requests.status_codes._codes[res.status_code][0],))
# check that the site we are hitting is the domain specified through sys.argv
if self.root_url in url[1]:
all_links = self._get_a_tags(res)
for l in all_links:
# Make sure we have not already checked the url and add it to our
if l not in self.seen:
self.seen.add(l)
self.queue.put((url[1], l,))
else:
continue
except Exception as e:
# in case the http request fails due to bad url or protocol
self.results.append((url[0], url[1], e))
# our task is complete!
self.queue.task_done()
def _get_a_tags(self, res):
# feed response data to instance of HferFinder class
parser = HrefFinder()
parser.feed(res.text)
return self._construct_urls(parser.get_hrefs())
def _construct_urls(self, hrefs):
# Construct valid absolute urls from href links on the page
links = []
for h in hrefs:
url = None
if 'http' in h or 'https' in h:
url = '' + h
else:
url = self.root_url + h
if url and 'mailto:' not in url:
links.append(url)
else:
pass
return links
def get_url_arg():
# Get command line argument or prompt user for url
url = None
if len(sys.argv) > 1:
url = sys.argv[1]
else:
url = input("please enter a url:")
return url
if __name__ == '__main__':
url = get_url_arg()
app = LinkChecker(url)
app.run()
Final thoughts
There are a number of ways we may decide to extend this application, some ideas include:
- Output to a CSV report
- Create a web application where the results are returned as html
- Create a desktop application with a library such as Tkinter or Kivy