A simple web crawler built using Scrapy.
crawlit is limited to one domain. Given a starting URL – say http://www.example.com - it will visit all pages within the domain, but not follow the links to external sites such as Google or Twitter.
crawlit works in 2 stages:
- crawl
- crawlit crawls the given domain finding all internal an external links and static content, recording what it finds in a JSON Lines file
- display
- crawlit parses the JSON Lines file and renders it into a HTML file which can be viewed in the web browser of your choice
crawlit requires you have Python3 and pip installed.
Other required packages are downloaded during installation.
To install crawlit:
$ git clone https://github.com/callumski/crawlit.git
$ cd crawlit
$ make setup
This will:
- download the git repository
- cd into the folder
- run a
makecommand that creates a virtualenv and installs are the required dependencies
For convenience there is also:
$ make all
Which will setup crawlit and run the tests.
N.B. Running make all also runs make clean which will remove the virtualenv, any Python bytecode files and the ./output folder.
The tests are written using pytest. To run them:
$ make test
N.B. crawlit has been tested with Python 3 on OSX High Sierra.
To crawl the domain www.example.com, have it rendered and then displayed in your default web browser:
$ make run url=http://www.example.com
N.B. The whole URL including scheme is necessary.
To crawl the domain www.example.com:
$ make crawl url=http://www.example.com
N.B. The whole URL including scheme is necessary.
This will write a JSON Lines file to the ./output folder. The file will be name crawlit.NNNNNNNNNN.json. Where NNNNNNNNNN is the system time since Epoch in milliseconds.
To render and display a given crawlit JSON file:
$ CRAWLIT_JSON_FILE=path.to/crawlit.NNNNNNNNNN.json make display
As it offers a fully featured framework for web-scraping Scrapy was chosen. This prevented having to reinvent the wheel and gave several important features out of the box (including auto-throttling, respecting robots.txt, parallel downloads, xpath navigation, a variety of output formats).
JSON Lines was chosen as an output format as it will generate a valid JSON file, even if the crawler is stopped before completing the crawl of the website. It also means that the output can be parsed in a memory efficient way if necessary.
Each page is represented as a single CrawlitItem object. This allows for simple atomic processing of each page, the list of internal links for each page would allow a graph of the pages to be easily generated from the list of items.
The displaying of the output by rendering it to an HTML file with Jinja2 was chosen due to the simplicity of implementation. For a very large site the HTML file might grow too large, pagination could help with this, also the rendering could be avoided if output JSON was loaded using javascript. This was beyond scope of this project. The styling of the HTML is pretty barebones, again chosen due to ease of implementation.
The following extensions are possible:
- Better handling of non-HTTP schemes.
mailto:andtel:are not treated specifically - Tests for the HTML rendering
- Error checking for the absence of schema on input domain
- Add the ability to recommence a crawl that was stopped. This could be done by passing in the output from a previous crawl to populate the list of parsed pages
- In the HTML display, we could allow navigation between internal pages in sitemap. Currently the links are to the actual URL's but it might be nicer to be able to navigate the sitemap
- Validation of functionality on other operating systems
- Better data handling for very large websites. It is possible the current display functionality would be too resource intensive for very large domains
- Giving it an end-to-end HTML UI to allow it to be deployed as a web application
- Storing of output in a database