This is a web crawler built using Node.js and Puppeteer, designed for flexible and powerful web scraping and data extraction. It supports various features like link following, pattern matching, cookie handling, local storage interaction, and more. It uses Redis for link caching and management.
- Headless Browser Support: Leverages Puppeteer for interacting with websites as a real browser.
- Redis Integration: Uses Redis for efficient link caching and management, preventing duplicate crawls.
- Configurable Crawling: Offers a wide range of options to customize the crawling process.
- Pattern Matching: Allows defining patterns to filter URLs to be crawled.
- Cookie Handling: Supports saving and using cookies during crawling.
- Local Storage Interaction: Can read and use data from local storage files.
- Action Execution: Can execute predefined actions on web pages.
- Cloning: Supports cloning websites by downloading all matching resources.
- Exclusion/Inclusion: Allows defining lists of URLs to exclude or include during crawling.
- Curl Integration: Can use
curlfor downloading resources. - Seed URL Support: Can start crawling from multiple seed URLs.
- Wait Times: Configurable wait times for page loading and between crawls.
- View Only: Can be used to just view the page without downloading.
-
Clone the repository:
git clone https://github.com/e-tang/tyo-crawler.git cd tyo-crawler -
Install dependencies:
npm install
-
Redis: Ensure you have Redis installed and running.
# Install Redis (example for Ubuntu) sudo apt-get update sudo apt-get install redis-server # Start Redis sudo systemctl start redis-server
Alternatively, you can use Docker / Docker Compose provided in the repository to run Redis:
cd tyo-crawler cd docker docker-compose up -d
- Type: string (CSS selector) or null
- Purpose: Specifies a CSS selector. If provided, the crawler will attempt to click on the element matching this selector on each page it visits. Useful for interacting with dynamic content (e.g., loading more items, navigating through pagination).
- Default:
null(no clicking).
- Type: boolean
- Purpose: Determines whether the browser window should be visible during the crawl.
true: The browser is visible.false(default): The browser runs in headless mode.
- Default:
false(headless).
- Type: boolean
- Purpose: Controls whether to use a full browser (Puppeteer) for crawling.
true(default): Uses Puppeteer.false: Uses curl or wget, which is faster but doesn’t execute JavaScript.
- Default:
true(use browser).
- Type: number (seconds)
- Purpose: Sets the wait time (in seconds) between crawling each subsequent page to avoid overloading the server.
- Default:
31seconds.
- Type: number
- Purpose: Defines the maximum depth of links to follow.
-1: Unlimited depth.0: Only crawl the initial URL(s).1: Crawl the initial URL(s) and links found on those pages.n: Crawl up tonlevels deep.
- Default:
-1(unlimited).
- Type: string (regular expression) or null
- Purpose: A regular expression to filter URLs. Only matching URLs will be crawled.
- Default:
null(crawl all links).
- Type: boolean
- Purpose: If
true, the hostname of the first URL is used as the Redis namespace. - Default:
false(default namespace:tmp).
- Type: array
- Purpose: Placeholder for additional custom options.
- Default:
[](empty array).
- Type: string (file path) or null
- Purpose: Specifies where to save the content of a single crawled URL.
- Default:
null(no output file).
- Type: string
- Purpose: The hostname or IP address of the Redis server.
- Default:
"localhost".
- Type: number
- Purpose: The port number of the Redis server.
- Default:
6379.
- Type: boolean
- Purpose: Enables or disables cookie handling.
- Default:
false(no cookie handling).
- Type: string (directory path)
- Purpose: The root directory where downloaded files are saved.
- Default:
"./www".
- Type: boolean
- Purpose: Enables website cloning mode, downloading all resources (images, CSS, JavaScript, etc.).
- Default:
false.
- Type: string (URL path) or null
- Purpose: Specifies the URL path to match for cloning.
- Default:
null(no specific path).
- Type: array of strings (regular expressions)
- Purpose: URLs matching any of these patterns will be excluded.
- Default:
[](no exclusions).
- Type: array of strings (regular expressions)
- Purpose: Only URLs matching any of these patterns will be included.
- Default:
[](no specific inclusions).
- Type: boolean
- Purpose: Uses curl to download resources instead of the browser.
- Default:
false.
- Type: boolean
- Purpose: If
true, the crawler only views pages without downloading them. - Default:
false.
- Type: boolean
- Purpose: Enables seed mode, allowing multiple starting URLs.
- Default:
false.
- Type: string (file path) or null
- Purpose: Path to a JSON file containing local storage data to inject into the browser.
- Default:
null.
- Type: string (file path) or null
- Purpose: Path to a JSON file containing actions to perform on pages.
- Default:
null.
- Type: number (milliseconds)
- Purpose: Default wait time (in milliseconds) before the next crawl.
- Default:
1200ms (1.2 seconds).
- Type: number (milliseconds)
- Purpose: Default wait time for the browser to wait after a page loads.
- Default:
0milliseconds.
The crawler is executed via the index.js file using Node.js with the parameter(s) explained above.
For example, to crawl a webpage with specific options, you can run:
node index.js --show-window true example.com/awesome-pageBut some pages may require authentication or other actions. In such cases, you can create a JSON file with the desired options and run the crawler with that file. Please see actions.example.json for an example configuration.
node crawler.js --actions-file actions.json --show-window true example.com/login- There are a lot to do
- COOKIES (--with-cookies) is not working