You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
English | [简体中文](https://github.com/coder-hxl/x-crawl/blob/main/docs/cn.md)
4
4
5
5
x-crawl is a flexible nodejs crawler library. You can crawl pages and control operations such as pages, batch network requests, and batch downloads of file resources. Support asynchronous/synchronous mode crawling data. Running on nodejs, the usage is flexible and simple, friendly to JS/TS developers.
6
6
7
-
If you feel good, you can support[x-crawl repository](https://github.com/coder-hxl/x-crawl)with a Star.
7
+
If you feel good, you can give[x-crawl repository](https://github.com/coder-hxl/x-crawl) a Star to support it, your Star will be the motivation for my update.
8
8
9
9
## Features
10
10
11
-
- Cules data for asynchronous/synchronous ways.
12
-
- In three ways to obtain the results of the three ways of supporting Promise, Callback, and Promise + Callback.
13
-
- RquestConfig has 5 ways of writing.
14
-
- Flexible request interval.
15
-
- Operations such as crawling pages, batch network requests, and batch downloading of file resources can be performed with simple configuration.
16
-
- The rotation function, crawl regularly.
17
-
- The built -in Puppeteer crawl the page and uses the JSDOM library to analyze the page, or it can also be parsed by itself.
18
-
- Chopening with TypeScript, possessing type prompts, and providing generic types.
11
+
- Support asynchronous/synchronous way to crawl data.
12
+
- The writing method is very flexible and supports multiple ways to write request configuration and obtain crawling results.
13
+
- Flexible crawling interval, up to you to use/avoid high concurrent crawling.
14
+
- With simple configuration, operations such as crawling pages, batch network requests, and batch download of file resources can be performed.
15
+
- Possess polling function to crawl data regularly.
16
+
- The built-in puppeteer crawls the page, and uses the jsdom library to analyze the content of the page, and also supports self-analysis.
17
+
- Written in TypeScript, has types, provides generics.
19
18
20
19
## Relationship with puppeteer
21
20
22
21
The crawlPage API internally uses the [puppeteer](https://github.com/puppeteer/puppeteer) library to help us crawl pages.
23
22
24
-
We can do the following:
23
+
The return value of the crawlPage API will be able to do the following:
25
24
26
25
- Generate screenshots and PDFs of pages.
27
26
- Crawl a SPA (Single-Page Application) and generate pre-rendered content (i.e. "SSR" (Server-Side Rendering)).
@@ -43,7 +42,7 @@ We can do the following:
43
42
*[Crawl interface](#Crawl-interface)
44
43
*[Crawl files](#Crawl-files)
45
44
*[Start polling](#Start-polling)
46
-
*[Request interval time](#Request-interval-time)
45
+
*[Crawl interval](#Crawl-interval)
47
46
*[Multiple ways of writing requestConfig options](#Multiple-ways-of-writing-requestConfig-options)
48
47
*[Multiple ways to get results](#Multiple-ways-to-get-results)
49
48
-[API](#API)
@@ -101,7 +100,7 @@ import xCrawl from 'x-crawl'
101
100
// 2.Create a crawler instance
102
101
constmyXCrawl=xCrawl({
103
102
timeout:10000, // overtime time
104
-
intervalTime: { max:3000, min:2000 } //control request frequency
Start a polling crawl with [startPolling](#startPolling)
325
+
Start a polling crawl with [startPolling()](#startPolling) .
327
326
328
327
```js
329
328
importxCrawlfrom'x-crawl'
@@ -348,11 +347,11 @@ Callback function parameters:
348
347
- The count attribute records the current number of polling operations.
349
348
- stopPolling is a callback function, calling it can terminate subsequent polling operations.
350
349
351
-
### Request interval time
350
+
### Crawl interval
352
351
353
352
Setting the requests interval time can prevent too much concurrency and avoid too much pressure on the server.
354
353
355
-
It can be set when creating a crawler instance, or you can choose to set it separately for an API. The request interval time is controlled internally by the instance method, not by the instance to control the entire request interval time.
354
+
It can be set when creating a crawler instance, or you can choose to set it separately for an API. The crawl interval is controlled internally by the instance method, not by the instance to control the entire crawl interval.
356
355
357
356
```js
358
357
importxCrawlfrom'x-crawl'
@@ -510,7 +509,7 @@ import xCrawl from 'x-crawl'
510
509
const myXCrawl = xCrawl({
511
510
baseUrl: 'https://xxx.com',
512
511
timeout: 10000,
513
-
// The interval between requests, multiple requests are valid
512
+
// Crawling interval time, batch crawling is only valid
English | [简体中文](https://github.com/coder-hxl/x-crawl/blob/main/docs/cn.md)
4
4
5
5
x-crawl is a flexible nodejs crawler library. You can crawl pages and control operations such as pages, batch network requests, and batch downloads of file resources. Support asynchronous/synchronous mode crawling data. Running on nodejs, the usage is flexible and simple, friendly to JS/TS developers.
6
6
7
-
If you feel good, you can support[x-crawl repository](https://github.com/coder-hxl/x-crawl)with a Star.
7
+
If you feel good, you can give[x-crawl repository](https://github.com/coder-hxl/x-crawl) a Star to support it, your Star will be the motivation for my update.
8
8
9
9
## Features
10
10
11
-
- Cules data for asynchronous/synchronous ways.
12
-
- In three ways to obtain the results of the three ways of supporting Promise, Callback, and Promise + Callback.
13
-
- RquestConfig has 5 ways of writing.
14
-
- Flexible request interval.
15
-
- Operations such as crawling pages, batch network requests, and batch downloading of file resources can be performed with simple configuration.
16
-
- The rotation function, crawl regularly.
17
-
- The built -in Puppeteer crawl the page and uses the JSDOM library to analyze the page, or it can also be parsed by itself.
18
-
- Chopening with TypeScript, possessing type prompts, and providing generic types.
11
+
- Support asynchronous/synchronous way to crawl data.
12
+
- The writing method is very flexible and supports multiple ways to write request configuration and obtain crawling results.
13
+
- Flexible crawling interval, up to you to use/avoid high concurrent crawling.
14
+
- With simple configuration, operations such as crawling pages, batch network requests, and batch download of file resources can be performed.
15
+
- Possess polling function to crawl data regularly.
16
+
- The built-in puppeteer crawls the page, and uses the jsdom library to analyze the content of the page, and also supports self-analysis.
17
+
- Written in TypeScript, has types, provides generics.
19
18
20
19
## Relationship with puppeteer
21
20
22
21
The crawlPage API internally uses the [puppeteer](https://github.com/puppeteer/puppeteer) library to help us crawl pages.
23
22
24
-
We can do the following:
23
+
The return value of the crawlPage API will be able to do the following:
25
24
26
25
- Generate screenshots and PDFs of pages.
27
26
- Crawl a SPA (Single-Page Application) and generate pre-rendered content (i.e. "SSR" (Server-Side Rendering)).
@@ -43,7 +42,7 @@ We can do the following:
43
42
*[Crawl interface](#Crawl-interface)
44
43
*[Crawl files](#Crawl-files)
45
44
*[Start polling](#Start-polling)
46
-
*[Request interval time](#Request-interval-time)
45
+
*[Crawl interval](#Crawl-interval)
47
46
*[Multiple ways of writing requestConfig options](#Multiple-ways-of-writing-requestConfig-options)
48
47
*[Multiple ways to get results](#Multiple-ways-to-get-results)
49
48
-[API](#API)
@@ -101,7 +100,7 @@ import xCrawl from 'x-crawl'
101
100
// 2.Create a crawler instance
102
101
constmyXCrawl=xCrawl({
103
102
timeout:10000, // overtime time
104
-
intervalTime: { max:3000, min:2000 } //control request frequency
Start a polling crawl with [startPolling](#startPolling)
325
+
Start a polling crawl with [startPolling()](#startPolling) .
327
326
328
327
```js
329
328
importxCrawlfrom'x-crawl'
@@ -348,11 +347,11 @@ Callback function parameters:
348
347
- The count attribute records the current number of polling operations.
349
348
- stopPolling is a callback function, calling it can terminate subsequent polling operations.
350
349
351
-
### Request interval time
350
+
### Crawl interval
352
351
353
352
Setting the requests interval time can prevent too much concurrency and avoid too much pressure on the server.
354
353
355
-
It can be set when creating a crawler instance, or you can choose to set it separately for an API. The request interval time is controlled internally by the instance method, not by the instance to control the entire request interval time.
354
+
It can be set when creating a crawler instance, or you can choose to set it separately for an API. The crawl interval is controlled internally by the instance method, not by the instance to control the entire crawl interval.
356
355
357
356
```js
358
357
importxCrawlfrom'x-crawl'
@@ -510,7 +509,7 @@ import xCrawl from 'x-crawl'
510
509
const myXCrawl = xCrawl({
511
510
baseUrl: 'https://xxx.com',
512
511
timeout: 10000,
513
-
// The interval between requests, multiple requests are valid
512
+
// Crawling interval time, batch crawling is only valid
0 commit comments