Web-Crawler

A simple python crawler with BFS and Pagerank based priority queue. The project contains following files:

py_crawler.py: Python script for crawler which could be run as

>> python py_crawler.py

Beware that the program would require an API key from bing. You can find one at azure.com

It takes the following three inputs:

search_term: the query for the focused crawler

Method: bfs or pagerank

num_pages: Number of pages to be crawled

The python script was used to generate the following four logs:

ebbets field_bfs.log
knuckle sandwich_bfs.log
ebbets field_pagerank.log
knuckle sandwich_pagerank.log

Each of the log files contain 1000 crawled urls and other relevant information.

Description: The python crawler has two settings

1. BFS: Uses a simple queue and crawls pages according to the BFS algorithm

2. Page Rank: Maintains a priority queue running page rank on graph each time after crawling 30 urls

Major Functions:

Function Name	Description	Library Used
get_seed	Gets first 10 links from Bing	PyBing
can_fetch_url	Checks robots.txt for access allowance	Python RobotExclusion
save_file	Saves html contents of crawled urls	Python urllib
save_file	Catches various HTTP Error Codes	Python urllib.HTTPError
Normalize	Normalizes url and adds scheme (‘http’)	Python urlnorm
get_links	Parsed the html file for links	BeautifulSoup
validate_links	Makes sure only html files are crawled	None
max_per_domain	Rate Control	tldextract

Non Working Features:

Haven’t catered to cased where cis.poly.edu is same as csserv2.poly.edu

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Web-Crawler

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
LICENSE		LICENSE
README.md		README.md
ebbets field_bfs.log		ebbets field_bfs.log
ebbets field_pagerank.log		ebbets field_pagerank.log
knuckle sandwich_bfs.log		knuckle sandwich_bfs.log
knuckle sandwich_pagerank.log		knuckle sandwich_pagerank.log
py_crawler.py		py_crawler.py

License

holmes0078/Web-Crawler

Folders and files

Latest commit

History

Repository files navigation

Web-Crawler

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages