GitHub - fortyfortyy/CLI-web-crawler-script: A simple web crawler script that will fetch and process data from the specified website and all its subpages linked from the main page and subpages of subpages etc.

Web Crawler Script

A simple web crawler script that will fetch and process data from the specified website and all its subpages linked from the main page and subpages of subpages etc. Results are either print in console or save in csv or json format to the file.

Script has implemented Depth-First Search algorithm to save pages in Nodes

Examples & How to run the script

1) Results are saved in CSV/JSON file format where each row is representing one page, with the following columns/keys:

- link
- title
- number of internal links
- number of external links
- number of times url was referenced by other pages*

*if on the page there are multiple references to the one page, count it as a one Example csv file provided in the repository.

Example:

To run this script run crawl command

$ python app.py crawl --page <FullURL> --format <csv/json> --output <path_to_file

2) Script prints structure of the page as a tree in the following format:

Main page (5)
  subpage1 (2)
    subpage1_1 (0)
    subpage1_2 (0)
  subpage2 (1)
    subpage2_1 (0)

The subpage represents actual urls to pages and the numbers in parentheses represent the number of internal pages at the current level.
Example:

To run this script run print-tree command

$ python app.py print-tree --page <FullURL>

Technologies Used

Python 3.10
Aiohttp 3.8.3
Typer 0.6.1
Numpy 1.23.3
Unittest

(back to top)

Setup

To run this project, you need to install Python then create and active virtual environment

$ python3 -m venv env

Clone repo and install packages in requirements.txt

$ git clone https://github.com/fortyfortyy/CLI-web-crawler-script.git
$ cd ../CLI-web-crawler-script
$ pip install requirements.txt

Go to the web_clawler_script directory

$ cd web_clawler_script

And go back to Examples & How to run

$ cd web_clawler_script

(back to examples)

Contact

Email: [email protected]

(back to top)

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
tests		tests
web_crawler_script		web_crawler_script
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Web Crawler Script

Table of contents

Examples & How to run the script

2) Script prints structure of the page as a tree in the following format:

Technologies Used

Setup

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Languages

fortyfortyy/CLI-web-crawler-script

Folders and files

Latest commit

History

Repository files navigation

Web Crawler Script

Table of contents

Examples & How to run the script

2) Script prints structure of the page as a tree in the following format:

Technologies Used

Setup

Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages