Skip to content

A simple web crawler script that will fetch and process data from the specified website and all its subpages linked from the main page and subpages of subpages etc.

Notifications You must be signed in to change notification settings

fortyfortyy/CLI-web-crawler-script

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Web Crawler Script

A simple web crawler script that will fetch and process data from the specified website and all its subpages linked from the main page and subpages of subpages etc. Results are either print in console or save in csv or json format to the file.

Script has implemented Depth-First Search algorithm to save pages in Nodes

Table of contents

Examples & How to run the script

1) Results are saved in CSV/JSON file format where each row is representing one page, with the following columns/keys:

- link
- title
- number of internal links
- number of external links
- number of times url was referenced by other pages*

*if on the page there are multiple references to the one page, count it as a one Example csv file provided in the repository.

Example:

  • To run this script run crawl command
$ python app.py crawl --page <FullURL> --format <csv/json> --output <path_to_file

2) Script prints structure of the page as a tree in the following format:

Main page (5)
  subpage1 (2)
    subpage1_1 (0)
    subpage1_2 (0)
  subpage2 (1)
    subpage2_1 (0)

The subpage represents actual urls to pages and the numbers in parentheses represent the number of internal pages at the current level.
Example:

  • To run this script run print-tree command
$ python app.py print-tree --page <FullURL>

Technologies Used

  • Python 3.10
  • Aiohttp 3.8.3
  • Typer 0.6.1
  • Numpy 1.23.3
  • Unittest

(back to top)

Setup

  • To run this project, you need to install Python then create and active virtual environment
$ python3 -m venv env
  • Clone repo and install packages in requirements.txt
$ git clone https://github.com/fortyfortyy/CLI-web-crawler-script.git
$ cd ../CLI-web-crawler-script
$ pip install requirements.txt
  • Go to the web_clawler_script directory
$ cd web_clawler_script
  • And go back to Examples & How to run
$ cd web_clawler_script

(back to examples)

Contact

Email: [email protected]

(back to top)

About

A simple web crawler script that will fetch and process data from the specified website and all its subpages linked from the main page and subpages of subpages etc.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages