Skip to content

Commit 30d7db6

Browse files
committed
other
1 parent 7c92774 commit 30d7db6

File tree

5 files changed

+228
-147
lines changed

5 files changed

+228
-147
lines changed

README.md

+76-49
Original file line numberDiff line numberDiff line change
@@ -6,8 +6,9 @@ XCrawl is a Nodejs multifunctional crawler library. Crawl HTML, JSON, file resou
66

77
## highlights
88

9-
- Call the API to grab HTML, JSON, file resources, etc
10-
- Batch requests can choose the mode of sending asynchronously or sending synchronously
9+
- Simple configuration to grab HTML, JSON, file resources, etc.
10+
- Batch requests can choose mode asynchronous or synchronous
11+
- Anthropomorphic request interval
1112

1213
## Install
1314

@@ -33,7 +34,8 @@ const docsXCrawl = new XCrawl({
3334
})
3435

3536
// Call fetchHTML API to crawl
36-
docsXCrawl.fetchHTML('/zh/get-started').then((jsdom) => {
37+
docsXCrawl.fetchHTML('/zh/get-started').then((res) => {
38+
const { jsdom } = res.data
3739
console.log(jsdom.window.document.querySelector('title')?.textContent)
3840
})
3941
```
@@ -42,23 +44,21 @@ docsXCrawl.fetchHTML('/zh/get-started').then((jsdom) => {
4244
4345
### XCrawl
4446
45-
Create a crawler instance via new XCrawl.
47+
Create a crawler instance via new XCrawl. The request queue is maintained by the instance method itself and is not shared.
4648
4749
#### Type
4850
4951
```ts
5052
class XCrawl {
5153
private readonly baseConfig
5254
constructor(baseConfig?: IXCrawlBaseConifg)
53-
fetchHTML(config: string | IFetchHTMLConfig): Promise<JSDOM>
55+
fetchHTML(config: IFetchHTMLConfig): Promise<IFetchHTML>
5456
fetchData<T = any>(config: IFetchDataConfig): Promise<IFetchCommon<T>>
5557
fetchFile(config: IFetchFileConfig): Promise<IFetchCommon<IFileInfo>>
5658
}
5759
```
5860
59-
#### <div id="myXCrawl">Example</div>
60-
61-
myXCrawl is the crawler instance of the following example.
61+
#### Example
6262
6363
```js
6464
const myXCrawl = new XCrawl({
@@ -72,7 +72,11 @@ const myXCrawl = new XCrawl({
7272
})
7373
```
7474
75-
#### About the pattern
75+
Passing **baseConfig** is for **fetchHTML/fetchData/fetchFile** to use these values by default.
76+
77+
**Note:** To avoid repeated creation of instances in subsequent examples, **myXCrawl** here will be the crawler instance in the **fetchHTML/fetchData/fetchFile** example.
78+
79+
#### Mode
7680
7781
The mode option defaults to async .
7882
@@ -81,27 +85,37 @@ The mode option defaults to async .
8185
8286
If there is an interval time set, it is necessary to wait for the interval time to end before sending the request.
8387
88+
#### IntervalTime
89+
90+
The intervalTime option defaults to undefined . If there is a setting value, it will wait for a period of time before requesting, which can prevent too much concurrency and avoid too much pressure on the server.
91+
92+
- number: The time that must wait before each request is fixed
93+
- Object: Randomly select a value from max and min, which is more anthropomorphic
94+
95+
The first request is not to trigger the interval.
96+
8497
### fetchHTML
8598
86-
fetchHTML is the method of the above <a href="#myXCrawl" style="text-decoration: none">myXCrawl</a> instance, usually used to crawl HTML.
99+
fetchHTML is the method of the above [myXCrawl](https://github.com/coder-hxl/x-crawl#Example-1) instance, usually used to crawl HTML.
87100
88101
#### Type
89102
90103
```ts
91-
function fetchHTML(config: string | IFetchHTMLConfig): Promise<JSDOM>
104+
function fetchHTML(config: IFetchHTMLConfig): Promise<IFetchHTML>
92105
```
93106

94107
#### Example
95108

96109
```js
97-
myXCrawl.fetchHTML('/xxx').then((jsdom) => {
110+
myXCrawl.fetchHTML('/xxx').then((res) => {
111+
const { jsdom } = res.data
98112
console.log(jsdom.window.document.querySelector('title')?.textContent)
99113
})
100114
```
101115
102116
### fetchData
103117
104-
fetchData is the method of the above <a href="#myXCrawl" style="text-decoration: none">myXCrawl</a> instance, which is usually used to crawl APIs to obtain JSON data and so on.
118+
fetchData is the method of the above [myXCrawl](https://github.com/coder-hxl/x-crawl#Example-1) instance, which is usually used to crawl APIs to obtain JSON data and so on.
105119
106120
#### Type
107121
@@ -120,15 +134,15 @@ const requestConifg = [
120134

121135
myXCrawl.fetchData({
122136
requestConifg, // Request configuration, can be IRequestConfig | IRequestConfig[]
123-
intervalTime: 800 // Interval between next requests, multiple requests are valid
137+
intervalTime: { max: 5000, min: 1000 } // The intervalTime passed in when not using myXCrawl
124138
}).then(res => {
125139
console.log(res)
126140
})
127141
```
128142
129143
### fetchFile
130144
131-
fetchFile is the method of the above <a href="#myXCrawl" style="text-decoration: none">myXCrawl</a> instance, which is usually used to crawl files, such as pictures, pdf files, etc.
145+
fetchFile is the method of the above [myXCrawl](https://github.com/coder-hxl/x-crawl#Example-1) instance, which is usually used to crawl files, such as pictures, pdf files, etc.
132146
133147
#### Type
134148
@@ -157,23 +171,23 @@ myXCrawl.fetchFile({
157171
158172
## Types
159173
160-
- IAnyObject
174+
#### IAnyObject
161175
162176
```ts
163177
interface IAnyObject extends Object {
164178
[key: string | number | symbol]: any
165179
}
166180
```
167181
168-
- IMethod
182+
#### IMethod
169183
170-
```ts
184+
```ts
171185
type IMethod = 'get' | 'GET' | 'delete' | 'DELETE' | 'head' | 'HEAD' | 'options' | 'OPTIONS' | 'post' | 'POST' | 'put' | 'PUT' | 'patch' | 'PATCH' | 'purge' | 'PURGE' | 'link' | 'LINK' | 'unlink' | 'UNLINK'
172186
```
173187
174-
- IRequestConfig
188+
#### IRequestConfig
175189
176-
```ts
190+
```ts
177191
interface IRequestConfig {
178192
url: string
179193
method?: IMethod
@@ -184,7 +198,7 @@ interface IRequestConfig {
184198
}
185199
```
186200
187-
- IIntervalTime
201+
#### IIntervalTime
188202
189203
```ts
190204
type IIntervalTime = number | {
@@ -193,7 +207,7 @@ type IIntervalTime = number | {
193207
}
194208
```
195209
196-
- IFetchBaseConifg
210+
#### IFetchBaseConifg
197211
198212
```ts
199213
interface IFetchBaseConifg {
@@ -202,58 +216,71 @@ interface IFetchBaseConifg {
202216
}
203217
```
204218
205-
- IFetchCommon
219+
#### IXCrawlBaseConifg
206220
207221
```ts
208-
type IFetchCommon<T> = {
209-
id: number
210-
statusCode: number | undefined
211-
headers: IncomingHttpHeaders // node:http type
212-
data: T
213-
}[]
222+
interface IXCrawlBaseConifg {
223+
baseUrl?: string
224+
timeout?: number
225+
intervalTime?: IIntervalTime
226+
mode?: 'async' | 'sync'
227+
}
214228
```
215229
216-
- IFileInfo
230+
#### IFetchHTMLConfig
217231
218232
```ts
219-
interface IFileInfo {
220-
fileName: string
221-
mimeType: string
222-
size: number
223-
filePath: string
233+
type IFetchHTMLConfig = string | IRequestConfig
234+
```
235+
236+
#### IFetchDataConfig
237+
238+
```ts
239+
interface IFetchDataConfig extends IFetchBaseConifg {
224240
}
225241
```
226242
227-
- IXCrawlBaseConifg
243+
#### IFetchFileConfig
228244
229245
```ts
230-
interface IXCrawlBaseConifg {
231-
baseUrl?: string
232-
timeout?: number
233-
intervalTime?: IIntervalTime
234-
mode?: 'async' | 'sync' // default: 'async'
246+
interface IFetchFileConfig extends IFetchBaseConifg {
247+
fileConfig: {
248+
storeDir: string
249+
}
235250
}
236251
```
237252
238-
- IFetchHTMLConfig
253+
#### IFetchCommon
239254
240255
```ts
241-
interface IFetchHTMLConfig extends IRequestConfig {}
256+
type IFetchCommon<T> = {
257+
id: number
258+
statusCode: number | undefined
259+
headers: IncomingHttpHeaders // node:http type
260+
data: T
261+
}[]
242262
```
243263
244-
- IFetchDataConfig
264+
#### IFileInfo
245265
246266
```ts
247-
interface IFetchDataConfig extends IFetchBaseConifg {
267+
interface IFileInfo {
268+
fileName: string
269+
mimeType: string
270+
size: number
271+
filePath: string
248272
}
249273
```
250274
251-
- IFetchFileConfig
275+
#### IFetchHTML
252276
253277
```ts
254-
interface IFetchFileConfig extends IFetchBaseConifg {
255-
fileConfig: {
256-
storeDir: string
278+
interface IFetchHTML {
279+
statusCode: number | undefined
280+
headers: IncomingHttpHeaders
281+
data: {
282+
raw: string // HTML String
283+
jsdom: JSDOM // HTML parsing using the jsdom library
257284
}
258285
}
259286
```

0 commit comments

Comments
 (0)