Skip to content

Commit 602b3ce

Browse files
committed
reconsitution
1 parent a4cdc1d commit 602b3ce

File tree

13 files changed

+397
-215
lines changed

13 files changed

+397
-215
lines changed

README.md

+102-55
Original file line numberDiff line numberDiff line change
@@ -2,9 +2,9 @@
22

33
English | [简体中文](https://github.com/coder-hxl/x-crawl/blob/main/docs/cn.md)
44

5-
x-crawl is a Nodejs multifunctional crawler library.
5+
x-crawl is a flexible nodejs crawler library.
66

7-
If it helps you, please give the [repository](https://github.com/coder-hxl/x-crawl) a Star to support it.
7+
If it helps you, please give the [x-crawl repository](https://github.com/coder-hxl/x-crawl) a Star to support it.
88

99
## Features
1010

@@ -34,11 +34,12 @@ The following can be done:
3434
* [Create application](#Create-application)
3535
+ [An example of a crawler application](#An-example-of-a-crawler-application)
3636
+ [Choose crawling mode](#Choose-crawling-mode)
37-
+ [Set interval time](#Set-interval-time)
3837
+ [Multiple crawler application instances](#Multiple-crawler-application-instances)
3938
* [Crawl page](#Crawl-page)
4039
* [Crawl interface](#Crawl-interface)
4140
* [Crawl files](#Crawl-files)
41+
* [Request interval time](#Request-interval-time)
42+
* [Multiple ways of writing requestConfig options](#Multiple-ways-of-writing-requestConfig-options)
4243
- [API](#API)
4344
* [x-crawl](#x-crawl-2)
4445
+ [Type](#Type-1)
@@ -61,8 +62,9 @@ The following can be done:
6162
- [Types](#Types)
6263
* [AnyObject](#AnyObject)
6364
* [Method](#Method)
64-
* [RequestBaseConfig](#RequestBaseConfig)
65+
* [RequestConfigObject](#RequestConfigObject)
6566
* [RequestConfig](#RequestConfig)
67+
* [MergeRequestConfigObject](#MergeRequestConfigObject)
6668
* [IntervalTime](#IntervalTime)
6769
* [XCrawlBaseConfig](#XCrawlBaseConfig)
6870
* [CrawlBaseConfigV1](#CrawlBaseConfigV1)
@@ -90,6 +92,7 @@ Regular crawling: Get the recommended pictures of the youtube homepage every oth
9092

9193
```js
9294
// 1.Import module ES/CJS
95+
import path from 'node:path'
9396
import xCrawl from 'x-crawl'
9497

9598
// 2.Create a crawler instance
@@ -114,15 +117,17 @@ myXCrawl.startPolling({ d: 1 }, () => {
114117
const requestConfig = []
115118
imgEls.forEach((item) => {
116119
if (item.src) {
117-
requestConfig.push({ url: item.src })
120+
requestConfig.push(item.src)
118121
}
119122
})
120123

121124
// Call the crawlFile API to crawl pictures
122-
myXCrawl.crawlFile({ requestConfig, fileConfig: { storeDir: './upload' } })
125+
myXCrawl.crawlFile({
126+
requestConfig,
127+
fileConfig: { storeDir: path.resolve(__dirname, './upload') }
128+
})
123129
})
124130
})
125-
126131
```
127132

128133
running result:
@@ -174,25 +179,6 @@ The mode option defaults to async .
174179

175180
If there is an interval time set, it is necessary to wait for the interval time to end before sending the request.
176181

177-
#### Set interval time
178-
179-
Setting the interval time can prevent too much concurrency and avoid too much pressure on the server.
180-
181-
```js
182-
import xCrawl from 'x-crawl'
183-
184-
const myXCrawl = xCrawl({
185-
intervalTime: { max: 3000, min: 1000 }
186-
})
187-
```
188-
189-
The intervalTime option defaults to undefined . If there is a setting value, it will wait for a period of time before requesting, which can prevent too much concurrency and avoid too much pressure on the server.
190-
191-
- number: The time that must wait before each request is fixed
192-
- Object: Randomly select a value from max and min, which is more anthropomorphic
193-
194-
The first request is not to trigger the interval.
195-
196182
#### Multiple crawler application instances
197183

198184
```js
@@ -223,9 +209,9 @@ Crawl interface data through [crawlData()](#crawlData)
223209

224210
```js
225211
const requestConfig = [
226-
{ url: 'https://xxx.com/xxxx' },
227-
{ url: 'https://xxx.com/xxxx' },
228-
{ url: 'https://xxx.com/xxxx' }
212+
{ url: 'https://xxx.com/xxxx' },
213+
{ url: 'https://xxx.com/xxxx', method: 'POST', data: { name: 'coderhxl' } },
214+
{ url: 'https://xxx.com/xxxx' }
229215
]
230216

231217
myXCrawl.crawlData({ requestConfig }).then(res => {
@@ -240,11 +226,7 @@ Crawl file data via [crawlFile()](#crawlFile)
240226
```js
241227
import path from 'node:path'
242228

243-
const requestConfig = [
244-
{ url: 'https://xxx.com/xxxx' },
245-
{ url: 'https://xxx.com/xxxx' },
246-
{ url: 'https://xxx.com/xxxx' }
247-
]
229+
const requestConfig = [ 'https://xxx.com/xxxx', 'https://xxx.com/xxxx' ]
248230

249231
myXCrawl. crawlFile({
250232
requestConfig,
@@ -256,6 +238,66 @@ myXCrawl. crawlFile({
256238
})
257239
```
258240

241+
### Request interval time
242+
243+
Setting the requests interval time can prevent too much concurrency and avoid too much pressure on the server.
244+
245+
```js
246+
import xCrawl from 'x-crawl'
247+
248+
const myXCrawl = xCrawl({
249+
intervalTime: { max: 3000, min: 1000 }
250+
})
251+
```
252+
253+
The intervalTime option defaults to undefined . If there is a setting value, it will wait for a period of time before requesting, which can prevent too much concurrency and avoid too much pressure on the server.
254+
255+
- number: The time that must wait before each request is fixed
256+
- Object: Randomly select a value from max and min, which is more anthropomorphic
257+
258+
The first request is not to trigger the interval.
259+
260+
### Multiple ways of writing requestConfig options
261+
262+
The writing method of requestConfig is very flexible, there are 5 types in total, which can be:
263+
264+
- string
265+
- array of strings
266+
- object
267+
- array of objects
268+
- string plus object array
269+
270+
```js
271+
// requestConfig writing method 1:
272+
const requestConfig1 = 'https://xxx.com/xxxx'
273+
274+
// requestConfig writing method 2:
275+
const requestConfig2 = [ 'https://xxx.com/xxxx', 'https://xxx.com/xxxx', 'https://xxx.com/xxxx' ]
276+
277+
// requestConfig writing method 3:
278+
const requestConfig3 = {
279+
url: 'https://xxx.com/xxxx',
280+
method: 'POST',
281+
data: { name: 'coderhxl' }
282+
}
283+
284+
// requestConfig writing method 4:
285+
const requestConfig4 = [
286+
{ url: 'https://xxx.com/xxxx' },
287+
{ url: 'https://xxx.com/xxxx', method: 'POST', data: { name: 'coderhxl' } },
288+
{ url: 'https://xxx.com/xxxx' }
289+
]
290+
291+
// requestConfig writing method 5:
292+
const requestConfig5 = [
293+
'https://xxx.com/xxxx',
294+
{ url: 'https://xxx.com/xxxx', method: 'POST', data: { name: 'coderhxl' } },
295+
'https://xxx.com/xxxx'
296+
]
297+
```
298+
299+
It can be selected according to the actual situation.
300+
259301
## API
260302

261303
### x-crawl
@@ -356,9 +398,9 @@ function crawlData: <T = any>(
356398

357399
```js
358400
const requestConfig = [
359-
{ url: '/xxxx' },
360-
{ url: '/xxxx' },
361-
{ url: '/xxxx' }
401+
{ url: 'https://xxx.com/xxxx' },
402+
{ url: 'https://xxx.com/xxxx', method: 'POST', data: { name: 'coderhxl' } },
403+
{ url: 'https://xxx.com/xxxx' }
362404
]
363405
364406
myXCrawl.crawlData({ requestConfig }).then(res => {
@@ -387,11 +429,7 @@ function crawlFile: (
387429
#### Example
388430

389431
```js
390-
const requestConfig = [
391-
{ url: '/xxxx' },
392-
{ url: '/xxxx' },
393-
{ url: '/xxxx' }
394-
]
432+
const requestConfig = [ 'https://xxx.com/xxxx', 'https://xxx.com/xxxx' ]
395433
396434
myXCrawl.crawlFile({
397435
requestConfig,
@@ -443,24 +481,33 @@ interface AnyObject extends Object {
443481
type Method = 'get' | 'GET' | 'delete' | 'DELETE' | 'head' | 'HEAD' | 'options' | 'OPTONS' | 'post' | 'POST' | 'put' | 'PUT' | 'patch' | 'PATCH' | 'purge' | 'PURGE' | 'link' | 'LINK' | 'unlink' | 'UNLINK'
444482
```
445483

446-
### RequestBaseConfig
447-
448-
```ts
449-
interface RequestBaseConfig {
450-
url: string
451-
timeout?: number
452-
proxy?: string
453-
}
454-
```
455-
456-
### RequestConfig
484+
### RequestConfigObject
457485

458486
```ts
459-
interface RequestConfig extends RequestBaseConfig {
487+
interface RequestConfigObject {
488+
url: string
460489
method?: Method
461490
headers?: AnyObject
462491
params?: AnyObject
463492
data?: any
493+
timeout?: number
494+
proxy?: string
495+
}
496+
```
497+
498+
### RequestConfig
499+
500+
```ts
501+
type RequestConfig = string | RequestConfigObject
502+
```
503+
504+
### MergeRequestConfigObject
505+
506+
```ts
507+
interface MergeRequestConfigObject {
508+
url: string
509+
timeout?: number
510+
proxy?: string
464511
}
465512
```
466513

@@ -497,7 +544,7 @@ interface CrawlBaseConfigV1 {
497544
### CrawlPageConfig
498545

499546
```ts
500-
type CrawlPageConfig = string | RequestBaseConfig
547+
type CrawlPageConfig = string | MergeRequestConfigObject
501548
```
502549

503550
### CrawlDataConfig

0 commit comments

Comments
 (0)