Skip to content

Commit 82030e9

Browse files
committed
其他
1 parent 4890a9b commit 82030e9

File tree

7 files changed

+360
-571
lines changed

7 files changed

+360
-571
lines changed

README.md

+28-266
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,14 @@
1-
# <div id="en">x-crawl</div>
1+
# x-crawl
22

3-
English | <a href="#cn" style="text-decoration: none">简体中文</a>
3+
English | [简体中文](https://github.com/coder-hxl/x-crawl/blob/main/document/cn.md)
44

55
XCrawl is a Nodejs multifunctional crawler library. Crawl HTML, JSON, file resources, etc. through simple configuration.
66

7+
## highlights
8+
9+
- Call the API to grab HTML, JSON, file resources, etc
10+
- Batch requests can choose the mode of sending asynchronously or sending synchronously
11+
712
## Install
813

914
Take NPM as an example:
@@ -33,13 +38,13 @@ docsXCrawl.fetchHTML('/zh/get-started').then((jsdom) => {
3338
})
3439
```
3540
36-
## Key concept
41+
## Core concepts
3742
3843
### XCrawl
3944
4045
Create a crawler instance via new XCrawl.
4146
42-
- Type
47+
#### Type
4348
4449
```ts
4550
class XCrawl {
@@ -51,33 +56,42 @@ class XCrawl {
5156
}
5257
```
5358
54-
- <div id="myXCrawl">Example</div>
59+
#### <div id="myXCrawl">Example</div>
5560
5661
myXCrawl is the crawler instance of the following example.
5762
5863
```js
5964
const myXCrawl = new XCrawl({
6065
baseUrl: 'https://xxx.com',
6166
timeout: 10000,
62-
// The interval of the next request, multiple requests are valid
67+
// The interval between requests, multiple requests are valid
6368
intervalTime: {
6469
max: 2000,
6570
min: 1000
6671
}
6772
})
6873
```
6974
75+
#### About the pattern
76+
77+
The mode option defaults to async .
78+
79+
- async: In batch requests, the next request is made without waiting for the current request to complete
80+
- sync: In batch requests, you need to wait for this request to complete before making the next request
81+
82+
If there is an interval time set, it is necessary to wait for the interval time to end before sending the request.
83+
7084
### fetchHTML
7185
7286
fetchHTML is the method of the above <a href="#myXCrawl" style="text-decoration: none">myXCrawl</a> instance, usually used to crawl HTML.
7387
74-
- Type
88+
#### Type
7589
7690
```ts
7791
function fetchHTML(config: string | IFetchHTMLConfig): Promise<JSDOM>
7892
```
7993

80-
- Example
94+
#### Example
8195

8296
```js
8397
myXCrawl.fetchHTML('/xxx').then((jsdom) => {
@@ -89,13 +103,13 @@ myXCrawl.fetchHTML('/xxx').then((jsdom) => {
89103
90104
fetchData is the method of the above <a href="#myXCrawl" style="text-decoration: none">myXCrawl</a> instance, which is usually used to crawl APIs to obtain JSON data and so on.
91105
92-
- Type
106+
#### Type
93107
94108
```ts
95109
function fetchData<T = any>(config: IFetchDataConfig): Promise<IFetchCommon<T>>
96110
```
97111

98-
- Example
112+
#### Example
99113

100114
```js
101115
const requestConifg = [
@@ -116,13 +130,13 @@ myXCrawl.fetchData({
116130
117131
fetchFile is the method of the above <a href="#myXCrawl" style="text-decoration: none">myXCrawl</a> instance, which is usually used to crawl files, such as pictures, pdf files, etc.
118132
119-
- Type
133+
#### Type
120134
121135
```ts
122136
function fetchFile(config: IFetchFileConfig): Promise<IFetchCommon<IFileInfo>>
123137
```
124138

125-
- Example
139+
#### Example
126140

127141
```js
128142
const requestConifg = [
@@ -202,7 +216,7 @@ type IFetchCommon<T> = {
202216
- IFileInfo
203217
204218
```ts
205-
IFileInfo {
219+
interface IFileInfo {
206220
fileName: string
207221
mimeType: string
208222
size: number
@@ -217,6 +231,7 @@ interface IXCrawlBaseConifg {
217231
baseUrl?: string
218232
timeout?: number
219233
intervalTime?: IIntervalTime
234+
mode?: 'async' | 'sync' // default: 'async'
220235
}
221236
```
222237
@@ -246,256 +261,3 @@ interface IFetchFileConfig extends IFetchBaseConifg {
246261
## More
247262
248263
If you have any **questions** or **needs** , please submit **Issues in** https://github.com/coder-hxl/x-crawl/issues .
249-
250-
251-
---
252-
253-
254-
# <div id="cn">x-crawl</div>
255-
256-
<a href="#en" style="text-decoration: none">English</a> | 简体中文
257-
258-
XCrawl 是 Nodejs 多功能爬虫库。只需简单的配置即可抓取 HTML 、JSON、文件资源等等。
259-
260-
## 安装
261-
262-
以 NPM 为例:
263-
264-
```shell
265-
npm install x-crawl
266-
````
267-
268-
## 示例
269-
270-
获取 https://docs.github.com/zh/get-started 的标题为例:
271-
272-
```js
273-
// 导入模块 ES/CJS
274-
import XCrawl from 'x-crawl'
275-
276-
// 创建一个爬虫实例
277-
const docsXCrawl = new XCrawl({
278-
baseUrl: 'https://docs.github.com',
279-
timeout: 10000,
280-
intervalTime: { max: 2000, min: 1000 }
281-
})
282-
283-
// 调用 fetchHTML API 爬取
284-
docsXCrawl.fetchHTML('/zh/get-started').then((jsdom) => {
285-
console.log(jsdom.window.document.querySelector('title')?.textContent)
286-
})
287-
```
288-
289-
## 核心概念
290-
291-
### XCrawl
292-
293-
通过 new XCrawl 创建一个爬虫实例。
294-
295-
- 类型
296-
297-
```ts
298-
class XCrawl {
299-
private readonly baseConfig
300-
constructor(baseConfig?: IXCrawlBaseConifg)
301-
fetchHTML(config: string | IFetchHTMLConfig): Promise<JSDOM>
302-
fetchData<T = any>(config: IFetchDataConfig): Promise<IFetchCommon<T>>
303-
fetchFile(config: IFetchFileConfig): Promise<IFetchCommon<IFileInfo>>
304-
}
305-
```
306-
307-
- <div id="cn-myXCrawl" style="text-decoration: none">示例</div>
308-
309-
myXCrawl 为后面示例的爬虫实例。
310-
311-
```js
312-
const myXCrawl = new XCrawl({
313-
baseUrl: 'https://xxx.com',
314-
timeout: 10000,
315-
// 下次请求的间隔时间, 多个请求才有效
316-
intervalTime: {
317-
max: 2000,
318-
min: 1000
319-
}
320-
})
321-
```
322-
323-
### fetchData
324-
325-
fetch 是上面 <a href="#cn-myXCrawl" style="text-decoration: none">myXCrawl</a> 实例的方法,通常用于爬取 API ,可获取 JSON 数据等等。
326-
327-
- 类型
328-
329-
```ts
330-
function fetchData<T = any>(config: IFetchDataConfig): Promise<IFetchCommon<T>>
331-
```
332-
333-
- 示例
334-
335-
```js
336-
const requestConifg = [
337-
{ url: '/xxxx', method: 'GET' },
338-
{ url: '/xxxx', method: 'GET' },
339-
{ url: '/xxxx', method: 'GET' }
340-
]
341-
342-
myXCrawl.fetchData({
343-
requestConifg, // 请求配置, 可以是 IRequestConfig | IRequestConfig[]
344-
intervalTime: 800 // 下次请求的间隔时间, 多个请求才有效
345-
}).then(res => {
346-
console.log(res)
347-
})
348-
```
349-
350-
### fetchHTML
351-
352-
fetchHTML 是上面 <a href="#cn-myXCrawl" style="text-decoration: none">myXCrawl</a> 实例的方法,通常用于爬取 HTML
353-
354-
- 类型
355-
356-
```ts
357-
function fetchHTML(config: string | IFetchHTMLConfig): Promise<JSDOM>
358-
```
359-
360-
- 示例
361-
362-
```js
363-
myXCrawl.fetchHTML('/xxx').then((jsdom) => {
364-
console.log(jsdom.window.document.querySelector('title')?.textContent)
365-
})
366-
```
367-
368-
### fetchFile
369-
370-
fetchFile 是上面 <a href="#cn-myXCrawl" style="text-decoration: none">myXCrawl</a> 实例的方法,通常用于爬取文件,可获取图片、pdf 文件等等。
371-
372-
- 类型
373-
374-
```ts
375-
function fetchFile(config: IFetchFileConfig): Promise<IFetchCommon<IFileInfo>>
376-
```
377-
378-
- 示例
379-
380-
```js
381-
const requestConifg = [
382-
{ url: '/xxxx' },
383-
{ url: '/xxxx' },
384-
{ url: '/xxxx' }
385-
]
386-
387-
myXCrawl.fetchFile({
388-
requestConifg,
389-
fileConfig: {
390-
storeDir: path.resolve(__dirname, './upload') // 存放文件夹
391-
}
392-
}).then(fileInfos => {
393-
console.log(fileInfos)
394-
})
395-
```
396-
397-
## 类型
398-
399-
- IAnyObject
400-
401-
```ts
402-
interface IAnyObject extends Object {
403-
[key: string | number | symbol]: any
404-
}
405-
```
406-
407-
- IMethod
408-
409-
```ts
410-
type IMethod = 'get' | 'GET' | 'delete' | 'DELETE' | 'head' | 'HEAD' | 'options' | 'OPTIONS' | 'post' | 'POST' | 'put' | 'PUT' | 'patch' | 'PATCH' | 'purge' | 'PURGE' | 'link' | 'LINK' | 'unlink' | 'UNLINK'
411-
```
412-
413-
- IRequestConfig
414-
415-
```ts
416-
interface IRequestConfig {
417-
url: string
418-
method?: IMethod
419-
headers?: IAnyObject
420-
params?: IAnyObject
421-
data?: any
422-
timeout?: number
423-
}
424-
```
425-
426-
- IIntervalTime
427-
428-
```ts
429-
type IIntervalTime = number | {
430-
max: number
431-
min?: number
432-
}
433-
```
434-
435-
- IFetchBaseConifg
436-
437-
```ts
438-
interface IFetchBaseConifg {
439-
requestConifg: IRequestConfig | IRequestConfig[]
440-
intervalTime?: IIntervalTime
441-
}
442-
```
443-
444-
- IFetchCommon
445-
446-
```ts
447-
type IFetchCommon<T> = {
448-
id: number
449-
statusCode: number | undefined
450-
headers: IncomingHttpHeaders // node:http type
451-
data: T
452-
}[]
453-
```
454-
455-
- IFileInfo
456-
457-
```ts
458-
interface IFileInfo {
459-
fileName: string
460-
mimeType: string
461-
size: number
462-
filePath: string
463-
}
464-
```
465-
466-
- IXCrawlBaseConifg
467-
468-
```ts
469-
interface IXCrawlBaseConifg {
470-
baseUrl?: string
471-
timeout?: number
472-
intervalTime?: IIntervalTime
473-
}
474-
```
475-
476-
- IFetchHTMLConfig
477-
478-
```ts
479-
interface IFetchHTMLConfig extends IRequestConfig {}
480-
```
481-
482-
- IFetchDataConfig
483-
484-
```ts
485-
interface IFetchDataConfig extends IFetchBaseConifg {
486-
}
487-
```
488-
489-
- IFetchFileConfig
490-
491-
```ts
492-
interface IFetchFileConfig extends IFetchBaseConifg {
493-
fileConfig: {
494-
storeDir: string
495-
}
496-
}
497-
```
498-
499-
## 更多
500-
501-
如有 **问题****需求** 请在 https://github.com/coder-hxl/x-crawl/issues 中提 **Issues** 。

0 commit comments

Comments
 (0)