Skip to content

Raj100/site-crawl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Site Crawler

Let site-crawl, crawl. So you don't have to!

Installation

npm install -g site-crawl

Usage

-u, --url Start URL

-p, --parallel Parallel downloads (default: "5")

-d, --depth Max crawl depth (number or "infinite") (default: "3")

-f, --filetypes Comma-separated extensions (default: .pdf) (default: ".pdf")

-H, --header <header...> Custom headers (Key:Value)

-P, --password PDF decryption password

-o, --output Output directory (default: ".")

-t, --timeout timeout (default: "10000")

-r, --max-retries max_retries (default: "3")

-e, --errors Error log file (default: "errors.txt")

--dynamic Use Puppeteer to render JS-based pages

-h, --help display help for command

Example usage

site-crawl -u "https://example.com/start" -f .pdf,.docx -d infinite -p 10 --dynamic
  • with custom headers
site-crawl -u "https://example.com/private" -H "Authorization:Bearer YOUR_TOKEN" -H "Accept-Language:en-US"
  • For Javascript based dynamically served sites
site-crawl -u "https://nitgoa.vercel.app/" -d 3 -p 5 -f .pdf --dynamic

About

A CLI tool to recursively crawl websites and download content.

Resources

License

Stars

Watchers

Forks

Packages

No packages published