1
- # < div id = " en " > x-crawl</ div >
1
+ # x-crawl
2
2
3
- English | < a href = " #cn " style = " text-decoration : none " > 简体中文</ a >
3
+ English | [ 简体中文] ( https://github.com/coder-hxl/x-crawl/blob/main/document/cn.md )
4
4
5
5
XCrawl is a Nodejs multifunctional crawler library. Crawl HTML, JSON, file resources, etc. through simple configuration.
6
6
7
+ ## highlights
8
+
9
+ - Call the API to grab HTML, JSON, file resources, etc
10
+ - Batch requests can choose the mode of sending asynchronously or sending synchronously
11
+
7
12
## Install
8
13
9
14
Take NPM as an example:
@@ -33,13 +38,13 @@ docsXCrawl.fetchHTML('/zh/get-started').then((jsdom) => {
33
38
})
34
39
` ` `
35
40
36
- ## Key concept
41
+ ## Core concepts
37
42
38
43
### XCrawl
39
44
40
45
Create a crawler instance via new XCrawl.
41
46
42
- - Type
47
+ #### Type
43
48
44
49
` ` ` ts
45
50
class XCrawl {
@@ -51,33 +56,42 @@ class XCrawl {
51
56
}
52
57
` ` `
53
58
54
- - <div id="myXCrawl">Example</div>
59
+ #### <div id="myXCrawl">Example</div>
55
60
56
61
myXCrawl is the crawler instance of the following example.
57
62
58
63
` ` ` js
59
64
const myXCrawl = new XCrawl ({
60
65
baseUrl: ' https://xxx.com' ,
61
66
timeout: 10000 ,
62
- // The interval of the next request , multiple requests are valid
67
+ // The interval between requests , multiple requests are valid
63
68
intervalTime: {
64
69
max: 2000 ,
65
70
min: 1000
66
71
}
67
72
})
68
73
` ` `
69
74
75
+ #### About the pattern
76
+
77
+ The mode option defaults to async .
78
+
79
+ - async: In batch requests, the next request is made without waiting for the current request to complete
80
+ - sync: In batch requests, you need to wait for this request to complete before making the next request
81
+
82
+ If there is an interval time set, it is necessary to wait for the interval time to end before sending the request.
83
+
70
84
### fetchHTML
71
85
72
86
fetchHTML is the method of the above <a href="#myXCrawl" style="text-decoration: none">myXCrawl</a> instance, usually used to crawl HTML.
73
87
74
- - Type
88
+ #### Type
75
89
76
90
` ` ` ts
77
91
function fetchHTML (config : string | IFetchHTMLConfig ): Promise<JSDOM>
78
92
```
79
93
80
- - Example
94
+ #### Example
81
95
82
96
```js
83
97
myXCrawl.fetchHTML('/xxx').then((jsdom) => {
@@ -89,13 +103,13 @@ myXCrawl.fetchHTML('/xxx').then((jsdom) => {
89
103
90
104
fetchData is the method of the above <a href="#myXCrawl" style="text-decoration: none">myXCrawl</a> instance, which is usually used to crawl APIs to obtain JSON data and so on.
91
105
92
- - Type
106
+ #### Type
93
107
94
108
` ` ` ts
95
109
function fetchData<T = any>(config : IFetchDataConfig ): Promise<IFetchCommon<T>>
96
110
```
97
111
98
- - Example
112
+ #### Example
99
113
100
114
```js
101
115
const requestConifg = [
@@ -116,13 +130,13 @@ myXCrawl.fetchData({
116
130
117
131
fetchFile is the method of the above <a href="#myXCrawl" style="text-decoration: none">myXCrawl</a> instance, which is usually used to crawl files, such as pictures, pdf files, etc.
118
132
119
- - Type
133
+ #### Type
120
134
121
135
` ` ` ts
122
136
function fetchFile (config : IFetchFileConfig ): Promise<IFetchCommon<IFileInfo>>
123
137
```
124
138
125
- - Example
139
+ #### Example
126
140
127
141
```js
128
142
const requestConifg = [
@@ -202,7 +216,7 @@ type IFetchCommon<T> = {
202
216
- IFileInfo
203
217
204
218
` ` ` ts
205
- IFileInfo {
219
+ interface IFileInfo {
206
220
fileName: string
207
221
mimeType: string
208
222
size: number
@@ -217,6 +231,7 @@ interface IXCrawlBaseConifg {
217
231
baseUrl?: string
218
232
timeout?: number
219
233
intervalTime?: IIntervalTime
234
+ mode?: ' async' | ' sync' // default: 'async'
220
235
}
221
236
` ` `
222
237
@@ -246,256 +261,3 @@ interface IFetchFileConfig extends IFetchBaseConifg {
246
261
## More
247
262
248
263
If you have any **questions** or **needs** , please submit **Issues in** https://github.com/coder-hxl/x-crawl/issues .
249
-
250
-
251
- ---
252
-
253
-
254
- # <div id="cn">x-crawl</div>
255
-
256
- <a href="#en" style="text-decoration: none">English</a> | 简体中文
257
-
258
- XCrawl 是 Nodejs 多功能爬虫库。只需简单的配置即可抓取 HTML 、JSON、文件资源等等。
259
-
260
- ## 安装
261
-
262
- 以 NPM 为例:
263
-
264
- ` ` ` shell
265
- npm install x- crawl
266
- ` ` ` `
267
-
268
- ## 示例
269
-
270
- 获取 https: // docs.github.com/zh/get-started 的标题为例:
271
-
272
- ` ` ` js
273
- // 导入模块 ES/CJS
274
- import XCrawl from 'x-crawl'
275
-
276
- // 创建一个爬虫实例
277
- const docsXCrawl = new XCrawl({
278
- baseUrl: 'https://docs.github.com',
279
- timeout: 10000,
280
- intervalTime: { max: 2000, min: 1000 }
281
- })
282
-
283
- // 调用 fetchHTML API 爬取
284
- docsXCrawl.fetchHTML('/zh/get-started').then((jsdom) => {
285
- console.log(jsdom.window.document.querySelector('title')?.textContent)
286
- })
287
- ` ` `
288
-
289
- ## 核心概念
290
-
291
- ### XCrawl
292
-
293
- 通过 new XCrawl 创建一个爬虫实例。
294
-
295
- - 类型
296
-
297
- ` ` ` ts
298
- class XCrawl {
299
- private readonly baseConfig
300
- constructor(baseConfig?: IXCrawlBaseConifg)
301
- fetchHTML(config: string | IFetchHTMLConfig): Promise<JSDOM>
302
- fetchData<T = any>(config: IFetchDataConfig): Promise<IFetchCommon<T>>
303
- fetchFile(config: IFetchFileConfig): Promise<IFetchCommon<IFileInfo>>
304
- }
305
- ` ` `
306
-
307
- - < div id= " cn-myXCrawl" style= " text-decoration: none" > 示例< / div>
308
-
309
- myXCrawl 为后面示例的爬虫实例。
310
-
311
- ` ` ` js
312
- const myXCrawl = new XCrawl({
313
- baseUrl: 'https://xxx.com',
314
- timeout: 10000,
315
- // 下次请求的间隔时间, 多个请求才有效
316
- intervalTime: {
317
- max: 2000,
318
- min: 1000
319
- }
320
- })
321
- ` ` `
322
-
323
- ### fetchData
324
-
325
- fetch 是上面 < a href= " #cn-myXCrawl" style= " text-decoration: none" > myXCrawl< / a> 实例的方法,通常用于爬取 API ,可获取 JSON 数据等等。
326
-
327
- - 类型
328
-
329
- ` ` ` ts
330
- function fetchData<T = any>(config: IFetchDataConfig): Promise<IFetchCommon<T>>
331
- ` ` `
332
-
333
- - 示例
334
-
335
- ` ` ` js
336
- const requestConifg = [
337
- { url: '/xxxx', method: 'GET' },
338
- { url: '/xxxx', method: 'GET' },
339
- { url: '/xxxx', method: 'GET' }
340
- ]
341
-
342
- myXCrawl.fetchData({
343
- requestConifg, // 请求配置, 可以是 IRequestConfig | IRequestConfig[]
344
- intervalTime: 800 // 下次请求的间隔时间, 多个请求才有效
345
- }).then(res => {
346
- console.log(res)
347
- })
348
- ` ` `
349
-
350
- ### fetchHTML
351
-
352
- fetchHTML 是上面 < a href= " #cn-myXCrawl" style= " text-decoration: none" > myXCrawl< / a> 实例的方法,通常用于爬取 HTML 。
353
-
354
- - 类型
355
-
356
- ` ` ` ts
357
- function fetchHTML(config: string | IFetchHTMLConfig): Promise<JSDOM>
358
- ` ` `
359
-
360
- - 示例
361
-
362
- ` ` ` js
363
- myXCrawl.fetchHTML('/xxx').then((jsdom) => {
364
- console.log(jsdom.window.document.querySelector('title')?.textContent)
365
- })
366
- ` ` `
367
-
368
- ### fetchFile
369
-
370
- fetchFile 是上面 < a href= " #cn-myXCrawl" style= " text-decoration: none" > myXCrawl< / a> 实例的方法,通常用于爬取文件,可获取图片、pdf 文件等等。
371
-
372
- - 类型
373
-
374
- ` ` ` ts
375
- function fetchFile(config: IFetchFileConfig): Promise<IFetchCommon<IFileInfo>>
376
- ` ` `
377
-
378
- - 示例
379
-
380
- ` ` ` js
381
- const requestConifg = [
382
- { url: '/xxxx' },
383
- { url: '/xxxx' },
384
- { url: '/xxxx' }
385
- ]
386
-
387
- myXCrawl.fetchFile({
388
- requestConifg,
389
- fileConfig: {
390
- storeDir: path.resolve(__dirname, './upload') // 存放文件夹
391
- }
392
- }).then(fileInfos => {
393
- console.log(fileInfos)
394
- })
395
- ` ` `
396
-
397
- ## 类型
398
-
399
- - IAnyObject
400
-
401
- ` ` ` ts
402
- interface IAnyObject extends Object {
403
- [key: string | number | symbol]: any
404
- }
405
- ` ` `
406
-
407
- - IMethod
408
-
409
- ` ` ` ts
410
- type IMethod = 'get' | 'GET' | 'delete' | 'DELETE' | 'head' | 'HEAD' | 'options' | 'OPTIONS' | 'post' | 'POST' | 'put' | 'PUT' | 'patch' | 'PATCH' | 'purge' | 'PURGE' | 'link' | 'LINK' | 'unlink' | 'UNLINK'
411
- ` ` `
412
-
413
- - IRequestConfig
414
-
415
- ` ` ` ts
416
- interface IRequestConfig {
417
- url: string
418
- method?: IMethod
419
- headers?: IAnyObject
420
- params?: IAnyObject
421
- data?: any
422
- timeout?: number
423
- }
424
- ` ` `
425
-
426
- - IIntervalTime
427
-
428
- ` ` ` ts
429
- type IIntervalTime = number | {
430
- max: number
431
- min?: number
432
- }
433
- ` ` `
434
-
435
- - IFetchBaseConifg
436
-
437
- ` ` ` ts
438
- interface IFetchBaseConifg {
439
- requestConifg: IRequestConfig | IRequestConfig[]
440
- intervalTime?: IIntervalTime
441
- }
442
- ` ` `
443
-
444
- - IFetchCommon
445
-
446
- ` ` ` ts
447
- type IFetchCommon<T> = {
448
- id: number
449
- statusCode: number | undefined
450
- headers: IncomingHttpHeaders // node:http type
451
- data: T
452
- }[]
453
- ` ` `
454
-
455
- - IFileInfo
456
-
457
- ` ` ` ts
458
- interface IFileInfo {
459
- fileName: string
460
- mimeType: string
461
- size: number
462
- filePath: string
463
- }
464
- ` ` `
465
-
466
- - IXCrawlBaseConifg
467
-
468
- ` ` ` ts
469
- interface IXCrawlBaseConifg {
470
- baseUrl?: string
471
- timeout?: number
472
- intervalTime?: IIntervalTime
473
- }
474
- ` ` `
475
-
476
- - IFetchHTMLConfig
477
-
478
- ` ` ` ts
479
- interface IFetchHTMLConfig extends IRequestConfig {}
480
- ` ` `
481
-
482
- - IFetchDataConfig
483
-
484
- ` ` ` ts
485
- interface IFetchDataConfig extends IFetchBaseConifg {
486
- }
487
- ` ` `
488
-
489
- - IFetchFileConfig
490
-
491
- ` ` ` ts
492
- interface IFetchFileConfig extends IFetchBaseConifg {
493
- fileConfig: {
494
- storeDir: string
495
- }
496
- }
497
- ` ` `
498
-
499
- ## 更多
500
-
501
- 如有 ** 问题** 或 ** 需求** 请在 https: // github.com/coder-hxl/x-crawl/issues 中提 **Issues** 。
0 commit comments