Skip to content

Commit 56c9a0e

Browse files
committed
fetchHTML API rename fetchPage
1 parent 4f863bb commit 56c9a0e

File tree

16 files changed

+166
-219
lines changed

16 files changed

+166
-219
lines changed

README.md

+36-39
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@ x-crawl is a Nodejs multifunctional crawler library.
1616

1717
## Relationship with puppeteer
1818

19-
The fetchHTML API internally uses the [puppeteer ](https://github.com/puppeteer/puppeteer) library to crawl pages.
19+
The fetchPage API internally uses the [puppeteer](https://github.com/puppeteer/puppeteer) library to crawl pages.
2020

2121
The following can be done:
2222

@@ -34,7 +34,7 @@ The following can be done:
3434
+ [Example](#Example-1)
3535
+ [Mode](#Mode)
3636
+ [IntervalTime](#IntervalTime)
37-
* [fetchHTML](#fetchHTML)
37+
* [fetchPage](#fetchPage)
3838
+ [Type](#Type-2)
3939
+ [Example](#Example-2)
4040
+ [About page](#About-page)
@@ -50,19 +50,19 @@ The following can be done:
5050
- [Types](#Types)
5151
* [AnyObject](#AnyObject)
5252
* [Method](#Method)
53+
* [RequestBaseConfig](#RequestBaseConfig)
5354
* [RequestConfig](#RequestConfig)
5455
* [IntervalTime](#IntervalTime)
5556
* [XCrawlBaseConfig](#XCrawlBaseConfig)
5657
* [FetchBaseConfigV1](#FetchBaseConfigV1)
57-
* [FetchBaseConfigV2](#FetchBaseConfigV2)
58-
* [FetchHTMLConfig](#FetchHTMLConfig )
58+
* [FetchPageConfig](#FetchPageConfig )
5959
* [FetchDataConfig](#FetchDataConfig)
6060
* [FetchFileConfig](#FetchFileConfig)
6161
* [StartPollingConfig](#StartPollingConfig)
6262
* [FetchResCommonV1](#FetchResCommonV1)
6363
* [FetchResCommonArrV1](#FetchResCommonArrV1)
6464
* [FileInfo](#FileInfo)
65-
* [FetchHTML](#FetchHTML)
65+
* [FetchPage](#FetchPage)
6666
- [More](#More)
6767

6868
## Install
@@ -90,9 +90,9 @@ const myXCrawl = xCrawl({
9090
// 3.Set the crawling task
9191
// Call the startPolling API to start the polling function, and the callback function will be called every other day
9292
myXCrawl.startPolling({ d: 1 }, () => {
93-
// Call fetchHTML API to crawl HTML
94-
myXCrawl.fetchHTML('https://www.youtube.com/').then((res) => {
95-
const { jsdom } = res.data // By default, the JSDOM library is used to parse HTML
93+
// Call fetchPage API to crawl Page
94+
myXCrawl.fetchPage('https://www.youtube.com/').then((res) => {
95+
const { jsdom } = res.data // By default, the JSDOM library is used to parse Page
9696

9797
// Get the cover image element of the Promoted Video
9898
const imgEls = jsdom.window.document.querySelectorAll(
@@ -124,7 +124,7 @@ running result:
124124
<img src="https://raw.githubusercontent.com/coder-hxl/x-crawl/main/assets/en/crawler-result.png" />
125125
</div>
126126

127-
**Note:** Do not crawl randomly, here is just to demonstrate how to use XCrawl, and control the request frequency within 3000ms to 2000ms.
127+
**Note:** Do not crawl randomly, here is just to demonstrate how to use x-crawl, and control the request frequency within 3000ms to 2000ms.
128128

129129
## Core concepts
130130

@@ -154,9 +154,9 @@ const myXCrawl = xCrawl({
154154
})
155155
```
156156

157-
Passing **baseConfig** is for **fetchHTML/fetchData/fetchFile** to use these values by default.
157+
Passing **baseConfig** is for **fetchPage/fetchData/fetchFile** to use these values by default.
158158

159-
**Note:** To avoid repeated creation of instances in subsequent examples, **myXCrawl** here will be the crawler instance in the **fetchHTML/fetchData/fetchFile** example.
159+
**Note:** To avoid repeated creation of instances in subsequent examples, **myXCrawl** here will be the crawler instance in the **fetchPage/fetchData/fetchFile** example.
160160

161161
#### Mode
162162

@@ -176,26 +176,26 @@ The intervalTime option defaults to undefined . If there is a setting value, it
176176

177177
The first request is not to trigger the interval.
178178

179-
### fetchHTML
179+
### fetchPage
180180

181-
fetchHTML is the method of the above [myXCrawl](https://github.com/coder-hxl/x-crawl#Example-1) instance, usually used to crawl page.
181+
fetchPage is the method of the above [myXCrawl](https://github.com/coder-hxl/x-crawl#Example-1) instance, usually used to crawl page.
182182

183183
#### Type
184184

185-
- Look at the [FetchHTMLConfig](#FetchHTMLConfig) type
186-
- Look at the [FetchHTML](#FetchHTML-2) type
185+
- Look at the [FetchPageConfig](#FetchPageConfig) type
186+
- Look at the [FetchPage](#FetchPage-2) type
187187

188188
```ts
189-
function fetchHTML: (
190-
config: FetchHTMLConfig,
191-
callback?: (res: FetchHTML) => void
192-
) => Promise<FetchHTML>
189+
function fetchPage: (
190+
config: FetchPageConfig,
191+
callback?: (res: FetchPage) => void
192+
) => Promise<FetchPage>
193193
```
194194

195195
#### Example
196196

197197
```js
198-
myXCrawl.fetchHTML('/xxx').then((res) => {
198+
myXCrawl.fetchPage('/xxx').then((res) => {
199199
const { jsdom } = res.data
200200
console.log(jsdom.window.document.querySelector('title')?.textContent)
201201
})
@@ -296,7 +296,7 @@ function startPolling(
296296
```js
297297
myXCrawl.startPolling({ h: 1, m: 30 }, () => {
298298
// will be executed every one and a half hours
299-
// fetchHTML/fetchData/fetchFile
299+
// fetchPage/fetchData/fetchFile
300300
})
301301
```
302302

@@ -316,17 +316,24 @@ interface AnyObject extends Object {
316316
type Method = 'get' | 'GET' | 'delete' | 'DELETE' | 'head' | 'HEAD' | 'options' | 'OPTONS' | 'post' | 'POST' | 'put' | 'PUT' | 'patch' | 'PATCH' | 'purge' | 'PURGE' | 'link' | 'LINK' | 'unlink' | 'UNLINK'
317317
```
318318

319+
### RequestBaseConfig
320+
321+
```ts
322+
interface RequestBaseConfig {
323+
url: string
324+
timeout?: number
325+
proxy?: string
326+
}
327+
```
328+
319329
### RequestConfig
320330

321331
```ts
322-
interface RequestConfig {
323-
url: string
332+
interface RequestConfig extends RequestBaseConfig {
324333
method?: Method
325334
headers?: AnyObject
326335
params?: AnyObject
327336
data?: any
328-
timeout?: number
329-
proxy?: string
330337
}
331338
```
332339

@@ -360,20 +367,10 @@ interface FetchBaseConfigV1 {
360367
}
361368
```
362369

363-
### FetchBaseConfigV2
364-
365-
```ts
366-
interface FetchBaseConfigV2 {
367-
url: string
368-
timeout?: number
369-
proxy?: string
370-
}
371-
```
372-
373-
### FetchHTMLConfig
370+
### FetchPageConfig
374371

375372
```ts
376-
type FetchHTMLConfig = string | FetchBaseConfigV2
373+
type FetchPageConfig = string | RequestBaseConfig
377374
```
378375

379376
### FetchDataConfig
@@ -432,10 +429,10 @@ interface FileInfo {
432429
}
433430
```
434431

435-
### FetchHTML
432+
### FetchPage
436433

437434
```ts
438-
interface FetchHTML {
435+
interface FetchPage {
439436
httpResponse: HTTPResponse | null // The type of HTTPResponse in the puppeteer library
440437
data: {
441438
page: Page // The type of Page in the puppeteer library

assets/cn/crawler.png

-9.5 KB
Loading

assets/en/crawler.png

31.4 KB
Loading

docs/cn.md

+37-41
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@ x-crawl 是 Nodejs 多功能爬虫库。
1616

1717
## 跟 puppeteer 的关系
1818

19-
fetchHTML API 内部使用 [puppeteer ](https://github.com/puppeteer/puppeteer) 库来爬取页面。
19+
fetchPage API 内部使用 [puppeteer](https://github.com/puppeteer/puppeteer) 库来爬取页面。
2020

2121
可以完成以下操作:
2222

@@ -34,7 +34,7 @@ fetchHTML API 内部使用 [puppeteer ](https://github.com/puppeteer/puppeteer)
3434
+ [示例](#示例-1)
3535
+ [模式](#模式)
3636
+ [间隔时间](#间隔时间)
37-
* [fetchHTML](#fetchHTML)
37+
* [fetchPage](#fetchPage)
3838
+ [类型](#类型-2)
3939
+ [示例](#示例-2)
4040
* [fetchData](#fetchData)
@@ -54,15 +54,14 @@ fetchHTML API 内部使用 [puppeteer ](https://github.com/puppeteer/puppeteer)
5454
* [IntervalTime](#IntervalTime)
5555
* [XCrawlBaseConfig](#XCrawlBaseConfig)
5656
* [FetchBaseConfigV1](#FetchBaseConfigV1)
57-
* [FetchBaseConfigV2](#FetchBaseConfigV2)
58-
* [FetchHTMLConfig](#FetchHTMLConfig )
57+
* [FetchPageConfig](#FetchPageConfig )
5958
* [FetchDataConfig](#FetchDataConfig)
6059
* [FetchFileConfig](#FetchFileConfig)
6160
* [StartPollingConfig](#StartPollingConfig)
6261
* [FetchResCommonV1](#FetchResCommonV1)
6362
* [FetchResCommonArrV1](#FetchResCommonArrV1)
6463
* [FileInfo](#FileInfo)
65-
* [FetchHTML](#FetchHTML)
64+
* [FetchPage](#FetchPage)
6665
- [更多](#更多)
6766

6867
## 安装
@@ -83,16 +82,16 @@ import xCrawl from 'x-crawl'
8382
8483
// 2.创建一个爬虫实例
8584
const myXCrawl = xCrawl({
86-
timeout: 10000, // overtime time
87-
intervalTime: { max: 3000, min: 2000 } // control request frequency
85+
timeout: 10000, // 请求超时时间
86+
intervalTime: { max: 3000, min: 2000 } // 控制请求频率
8887
})
8988
9089
// 3.设置爬取任务
9190
// 调用 startPolling API 开始轮询功能,每隔一天会调用回调函数
9291
myXCrawl.startPolling({ d: 1 }, () => {
93-
// 调用 fetchHTML API 爬取 HTML
94-
myXCrawl.fetchHTML('https://www.bilibili.com/guochuang/').then((res) => {
95-
const { jsdom } = res.data // 默认使用了 JSDOM 库解析 HTML
92+
// 调用 fetchPage API 爬取 Page
93+
myXCrawl.fetchPage('https://www.bilibili.com/guochuang/').then((res) => {
94+
const { jsdom } = res.data // 默认使用了 JSDOM 库解析 Page
9695
9796
// 获取轮播图片元素
9897
const imgEls = jsdom.window.document.querySelectorAll('.carousel-wrapper .chief-recom-item img')
@@ -117,7 +116,7 @@ myXCrawl.startPolling({ d: 1 }, () => {
117116
<img src="https://raw.githubusercontent.com/coder-hxl/x-crawl/main/assets/cn/crawler-result.png" />
118117
</div>
119118
120-
**注意:** 请勿随意爬取,这里只是为了演示如何使用 XCrawl ,并将请求频率控制在 3000ms 到 2000ms 内。
119+
**注意:** 请勿随意爬取,这里只是为了演示如何使用 x-crawl ,并将请求频率控制在 3000ms 到 2000ms 内。
121120
122121
## 核心概念
123122
@@ -147,9 +146,9 @@ const myXCrawl = xCrawl({
147146
})
148147
```
149148
150-
传入 **baseConfig** 是为了让 **fetchHTML/fetchData/fetchFile** 默认使用这些值。
149+
传入 **baseConfig** 是为了让 **fetchPage/fetchData/fetchFile** 默认使用这些值。
151150
152-
**注意:** 为避免后续示例需要重复创建实例,这里的 **myXCrawl** 将是 **fetchHTML/fetchData/fetchFile** 示例中的爬虫实例。
151+
**注意:** 为避免后续示例需要重复创建实例,这里的 **myXCrawl** 将是 **fetchPage/fetchData/fetchFile** 示例中的爬虫实例。
153152
154153
#### 模式
155154
@@ -169,26 +168,26 @@ intervalTime 选项默认为 undefined 。若有设置值,则会在请求前
169168
170169
第一次请求是不会触发间隔时间。
171170
172-
### fetchHTML
171+
### fetchPage
173172
174-
fetchHTML 是 [myXCrawl](https://github.com/coder-hxl/x-crawl/blob/main/document/cn.md#%E7%A4%BA%E4%BE%8B-1) 实例的方法,通常用于爬取页面。
173+
fetchPage 是 [myXCrawl](https://github.com/coder-hxl/x-crawl/blob/main/document/cn.md#%E7%A4%BA%E4%BE%8B-1) 实例的方法,通常用于爬取页面。
175174
176175
#### 类型
177176
178-
- 查看 [FetchHTMLConfig](#FetchHTMLConfig) 类型
179-
- 查看 [FetchHTML](#FetchHTML-2) 类型
177+
- 查看 [FetchPageConfig](#FetchPageConfig) 类型
178+
- 查看 [FetchPage](#FetchPage-2) 类型
180179
181180
```ts
182-
function fetchHTML: (
183-
config: FetchHTMLConfig,
184-
callback?: (res: FetchHTML) => void
185-
) => Promise<FetchHTML>
181+
function fetchPage: (
182+
config: FetchPageConfig,
183+
callback?: (res: FetchPage) => void
184+
) => Promise<FetchPage>
186185
```
187186
188187
#### 示例
189188
190189
```js
191-
myXCrawl.fetchHTML('/xxx').then((res) => {
190+
myXCrawl.fetchPage('/xxx').then((res) => {
192191
const { jsdom } = res.data
193192
console.log(jsdom.window.document.querySelector('title')?.textContent)
194193
})
@@ -289,7 +288,7 @@ function startPolling: (
289288
```js
290289
myXCrawl.startPolling({ h: 1, m: 30 }, () => {
291290
// 每隔一个半小时会执行一次
292-
// fetchHTML/fetchData/fetchFile
291+
// fetchPage/fetchData/fetchFile
293292
})
294293
```
295294
@@ -309,17 +308,24 @@ interface AnyObject extends Object {
309308
type Method = 'get' | 'GET' | 'delete' | 'DELETE' | 'head' | 'HEAD' | 'options' | 'OPTONS' | 'post' | 'POST' | 'put' | 'PUT' | 'patch' | 'PATCH' | 'purge' | 'PURGE' | 'link' | 'LINK' | 'unlink' | 'UNLINK'
310309
```
311310
311+
### RequestBaseConfig
312+
313+
```ts
314+
interface RequestBaseConfig {
315+
url: string
316+
timeout?: number
317+
proxy?: string
318+
}
319+
```
320+
312321
### RequestConfig
313322
314323
```ts
315-
interface RequestConfig {
316-
url: string
324+
interface RequestConfig extends RequestBaseConfig {
317325
method?: Method
318326
headers?: AnyObject
319327
params?: AnyObject
320328
data?: any
321-
timeout?: number
322-
proxy?: string
323329
}
324330
```
325331
@@ -353,20 +359,10 @@ interface FetchBaseConfigV1 {
353359
}
354360
```
355361
356-
### FetchBaseConfigV2
357-
358-
```ts
359-
interface FetchBaseConfigV2 {
360-
url: string
361-
timeout?: number
362-
proxy?: string
363-
}
364-
```
365-
366-
### FetchHTMLConfig
362+
### FetchPageConfig
367363
368364
```ts
369-
type FetchHTMLConfig = string | FetchBaseConfigV2
365+
type FetchPageConfig = string | RequestBaseConfig
370366
```
371367
372368
### FetchDataConfig
@@ -425,10 +421,10 @@ interface FileInfo {
425421
}
426422
```
427423
428-
### FetchHTML
424+
### FetchPage
429425
430426
```ts
431-
interface FetchHTML {
427+
interface FetchPage {
432428
httpResponse: HTTPResponse | null // puppeteer 库的 HTTPResponse 类型
433429
data: {
434430
page: Page // puppeteer 库的 Page 类型

package.json

+1-1
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
{
22
"private": true,
33
"name": "x-crawl",
4-
"version": "2.2.1",
4+
"version": "2.3.0",
55
"author": "coderHXL",
66
"description": "XCrawl is a Nodejs multifunctional crawler library.",
77
"license": "MIT",

0 commit comments

Comments
 (0)