Skip to content

Commit 1c66d05

Browse files
committed
Update: Docs
1 parent 0744a15 commit 1c66d05

File tree

5 files changed

+119
-45
lines changed

5 files changed

+119
-45
lines changed

README.md

+37-11
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
English | [简体中文](https://github.com/coder-hxl/x-crawl/blob/main/docs/cn.md)
44

5-
x-crawl is a flexible nodejs crawler library. Used to crawl pages, batch network requests, and batch download file resources. Crawl data in asynchronous or synchronous mode, 3 ways to get results, and 5 ways to write requestConfig. Runs on nodejs, friendly to JS/TS developers.
5+
x-crawl is a flexible nodejs crawler library. You can crawl pages and control operations such as pages, batch network requests, and batch downloads of file resources. Support asynchronous/synchronous mode crawling data. Running on nodejs, the usage is flexible and simple, friendly to JS/TS developers.
66

77
If you feel good, you can support [x-crawl repository](https://github.com/coder-hxl/x-crawl) with a Star.
88

@@ -11,8 +11,8 @@ If you feel good, you can support [x-crawl repository](https://github.com/coder-
1111
- Cules data for asynchronous/synchronous ways.
1212
- In three ways to obtain the results of the three ways of supporting Promise, Callback, and Promise + Callback.
1313
- RquestConfig has 5 ways of writing.
14-
- The anthropomorphic request interval time.
15-
- In a simple configuration, you can capture pages, JSON, file resources, and so on.
14+
- Flexible request interval.
15+
- Operations such as crawling pages, batch network requests, and batch downloading of file resources can be performed with simple configuration.
1616
- The rotation function, crawl regularly.
1717
- The built -in Puppeteer crawl the page and uses the JSDOM library to analyze the page, or it can also be parsed by itself.
1818
- Chopening with TypeScript, possessing type prompts, and providing generic types.
@@ -214,38 +214,64 @@ myXCrawl.crawlPage('https://xxx.com').then(res => {
214214

215215
#### jsdom instance
216216

217-
Refer to [jsdom](https://github.com/jsdom/jsdom) for specific usage.
217+
It is an instance object of [JSDOM](https://github.com/jsdom/jsdom), please refer to [jsdom](https://github.com/jsdom/jsdom) for specific usage.
218+
219+
**Note:** The jsdom instance only parses the content of [page instance](#page-instance), if you use page instance for event operation, you may need to parse the latest by yourself For details, please refer to the self-parsing page of [page instance](#page-instance).
218220

219221
#### browser instance
220222

221-
The browser instance is a headless browser without a UI shell. What he does is to bring **all modern network platform functions** provided by the browser rendering engine to the code.
223+
It is an instance object of [Browser](https://pptr.dev/api/puppeteer.browser). For specific usage, please refer to [Browser](https://pptr.dev/api/puppeteer.browser).
222224

223-
**Purpose of calling close:** The browser instance will always be running internally, causing the file not to be terminated. Do not call [crawlPage](#crawlPage) or [page](#page) if you need to use it later. When you modify the properties of a browser instance, it will affect the browser instance inside the crawlPage API of the crawler instance, the page instance that returns the result, and the browser instance, because the browser instance is shared within the crawlPage API of the same crawler instance.
225+
The browser instance is a headless browser without a UI shell. What he does is to bring **all modern network platform functions** provided by the browser rendering engine to the code.
224226

225-
Refer to [browser](https://pptr.dev/api/puppeteer.browser) for specific usage.
227+
**Note:** An event loop will always be generated inside the browser instance, causing the file not to be terminated. If you want to stop, you can execute browser.close() to close it. Do not call [crawlPage](#crawlPage) or [page](#page) if you need to use it later. Because when you modify the properties of the browser instance, it will affect the browser instance inside the crawlPage API of the crawler instance, the page instance that returns the result, and the browser instance, because the browser instance is shared within the crawlPage API of the same crawler instance.
226228

227229
#### page instance
228230

231+
It is an instance object of [Page](https://pptr.dev/api/puppeteer.page). The instance can also perform interactive operations such as events. For specific usage, please refer to [page](https://pptr.dev /api/puppeteer. page).
232+
233+
**Parse the page by yourself**
234+
235+
Take the jsdom library as an example:
236+
237+
```js
238+
import xCrawl from 'x-crawl'
239+
import { JSDOM } from 'jsdom'
240+
241+
const myXCrawl = xCrawl({ timeout: 10000 })
242+
243+
myXCrawl.crawlPage('https://www.xxx.com').then(async (res) => {
244+
const { page } = res
245+
246+
// Get the latest page content
247+
const content = await page.content()
248+
249+
// Use the jsdom library to parse it yourself
250+
const jsdom = new JSDOM(content)
251+
252+
console.log(jsdom.window.document.querySelector('title').textContent)
253+
})
254+
```
255+
229256
**Take Screenshot**
230257

231258
```js
232259
import xCrawl from 'x-crawl'
233260

234-
const testXCrawl = xCrawl({ timeout: 10000 })
261+
const myXCrawl = xCrawl({ timeout: 10000 })
235262

236-
testXCrawl
263+
myXCrawl
237264
.crawlPage('https://xxx.com')
238265
.then(async (res) => {
239266
const { page } = res
240267

268+
// Get a screenshot of the rendered page
241269
await page.screenshot({ path: './upload/page.png' })
242270

243271
console.log('Screen capture is complete')
244272
})
245273
```
246274

247-
The page instance can also perform interactive operations such as events. For details, refer to [page](https://pptr.dev/api/puppeteer.page).
248-
249275
### Crawl interface
250276

251277
Crawl interface data through [crawlData()](#crawlData)

docs/cn.md

+40-19
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
[English](https://github.com/coder-hxl/x-crawl#x-crawl) | 简体中文
44

5-
x-crawl 是一个灵活的 nodejs 爬虫库。用来爬取页面、批量网络请求以及批量下载文件资源。异步或同步模式爬取数据,3 种获取结果的写法,有 5 种 requestConfig 的写法。跑在 nodejs 上,对 JS/TS 开发者友好。
5+
x-crawl 是一个灵活的 nodejs 爬虫库。可以爬取页面并控制页面、批量网络请求以及批量下载文件资源等操作。支持 异步/同步 模式爬取数据。跑在 nodejs 上,用法灵活和简单,对 JS/TS 开发者友好。
66

77
如果感觉不错,可以给 [x-crawl 存储库](https://github.com/coder-hxl/x-crawl) 点个 Star 支持一下。
88

@@ -11,8 +11,8 @@ x-crawl 是一个灵活的 nodejs 爬虫库。用来爬取页面、批量网络
1111
- 支持 异步/同步 方式爬取数据。
1212
- 支持 Promise、Callback 以及 Promise + Callback 这 3 种方式获取结果。
1313
- requestConfig 拥有 5 种写法。
14-
- 拟人化的请求间隔时间
15-
- 只需简单的配置即可抓取页面、JSON、文件资源等等
14+
- 灵活的请求间隔时间
15+
- 只需简单的配置即可抓取页面、批量网络请求以及批量下载文件资源等操作
1616
- 轮询功能,定时爬取。
1717
- 内置 puppeteer 爬取页面 ,并用采用 jsdom 库对页面解析,也可自行解析。
1818
- 使用 TypeScript 编写,拥有类型提示,提供泛型。
@@ -30,9 +30,7 @@ crawlPage API 内部使用 [puppeteer](https://github.com/puppeteer/puppeteer)
3030
# 目录
3131

3232
- [安装](#安装)
33-
3433
- [示例](#示例)
35-
3634
- [核心概念](#核心概念)
3735
* [创建应用](#创建应用)
3836
+ [一个爬虫应用实例](#一个爬虫应用实例)
@@ -41,14 +39,13 @@ crawlPage API 内部使用 [puppeteer](https://github.com/puppeteer/puppeteer)
4139
* [爬取页面](#爬取页面)
4240
+ [jsdom 实例](#jsdom-实例)
4341
+ [browser 实例](#browser-实例)
44-
+ [page-实例](#page-实例)
42+
+ [page 实例](#page-实例)
4543
* [爬取接口](#爬取接口)
4644
* [爬取文件](#爬取文件)
4745
* [启动轮询](#启动轮询)
4846
* [请求间隔时间](#请求间隔时间)
4947
* [requestConfig 选项的多种写法](#requestConfig-选项的多种写法)
5048
* [获取结果的多种方式](#获取结果的多种方式)
51-
5249
- [API](#API)
5350
* [xCrawl](#xCrawl)
5451
+ [类型](#类型-1)
@@ -65,7 +62,6 @@ crawlPage API 内部使用 [puppeteer](https://github.com/puppeteer/puppeteer)
6562
* [startPolling](#startPolling)
6663
+ [类型](#类型-5)
6764
+ [示例](#示例-5)
68-
6965
- [类型](#类型-6)
7066
* [AnyObject](#AnyObject)
7167
* [Method](#Method)
@@ -82,8 +78,7 @@ crawlPage API 内部使用 [puppeteer](https://github.com/puppeteer/puppeteer)
8278
* [CrawlResCommonV1](#CrawlResCommonV1)
8379
* [CrawlResCommonArrV1](#CrawlResCommonArrV1)
8480
* [FileInfo](#FileInfo)
85-
* [CrawlPage](#CrawlPage)
86-
81+
* [CrawlPage](#CrawlPage)
8782
- [更多](#更多)
8883

8984
## 安装
@@ -113,7 +108,7 @@ const myXCrawl = xCrawl({
113108
myXCrawl.startPolling({ d: 1 }, () => {
114109
// 调用 crawlPage API 爬取 Page
115110
myXCrawl.crawlPage('https://www.bilibili.com/guochuang/').then((res) => {
116-
const { browser, jsdom } = res // 默认使用了 JSDOM 库解析 Page
111+
const { jsdom } = res // 默认使用了 JSDOM 库解析 Page
117112
118113
// 获取轮播图片元素
119114
const imgEls = jsdom.window.document.querySelectorAll('.chief-recom-item img')
@@ -211,38 +206,64 @@ myXCrawl.crawlPage('https://xxx.com').then(res => {
211206
212207
#### jsdom 实例
213208
214-
具体使用参考 [jsdom](https://github.com/jsdom/jsdom) 。
209+
它是 [JSDOM](https://github.com/jsdom/jsdom) 的实例对象,具体使用可以参考 [jsdom](https://github.com/jsdom/jsdom) 。
210+
211+
**注意:**jsdom 实例只是对 [page 实例](#page-实例) 的 content 进行了解析,如果您使用 page 实例进行了事件操作的话,可能需要自行解析最新的页面内容,具体操作可查看 [page 实例](#page-实例) 的自行解析页面。
215212
216213
#### browser 实例
217214
218-
browser 实例他是个无头浏览器,并无 UI 外壳,他做的是将浏览器渲染引擎提供的**所有现代网络平台功能**带到代码中
215+
它是 [Browser](https://pptr.dev/api/puppeteer.browser) 的实例对象,具体使用可以参考 [Browser](https://pptr.dev/api/puppeteer.browser)
219216
220-
**调用 close 的目的:** browser 实例内部会一直处于运行,造成文件不会终止。如果后面还需要用到 [crawlPage](#crawlPage) 或者 [page](#page) 请勿调用。当您修改 browser 实例的属性时,会对该爬虫实例 crawlPage API 内部的 browser 实例和返回结果的 page 实例以及 browser 实例造成影响,因为 browser 实例在同一个爬虫实例的 crawlPage API 内是共享的
217+
browser 实例他是个无头浏览器,并无 UI 外壳,他做的是将浏览器渲染引擎提供的**所有现代网络平台功能**带到代码中
221218
222-
具体使用参考 [browser](https://pptr.dev/api/puppeteer.browser)
219+
**注意:** browser 实例内部会一直产生事件循环,造成文件不会终止,如果想停止可以执行 browser.close() 关闭。如果后面还需要用到 [crawlPage](#crawlPage) 或者 [page](#page) 请勿调用。因为当您修改 browser 实例的属性时,会对该爬虫实例 crawlPage API 内部的 browser 实例和返回结果的 page 实例以及 browser 实例造成影响,因为 browser 实例在同一个爬虫实例的 crawlPage API 内是共享的
223220
224221
#### page 实例
225222
223+
它是 [Page](https://pptr.dev/api/puppeteer.page) 的实例对象,实例还可以做事件之类的交互操作,具体使用可以参考 [page](https://pptr.dev/api/puppeteer.page) 。
224+
225+
**自行解析页面**
226+
227+
以使用 jsdom 库为例:
228+
229+
```js
230+
import xCrawl from 'x-crawl'
231+
import { JSDOM } from 'jsdom'
232+
233+
const myXCrawl = xCrawl({ timeout: 10000 })
234+
235+
myXCrawl.crawlPage('https://www.xxx.com').then(async (res) => {
236+
const { page } = res
237+
238+
// 获取最新的页面内容
239+
const content = await page.content()
240+
241+
// 使用 jsdom 库自行解析
242+
const jsdom = new JSDOM(content)
243+
244+
console.log(jsdom.window.document.querySelector('title').textContent)
245+
})
246+
```
247+
226248
**获取屏幕截图**
227249
228250
```js
229251
import xCrawl from 'x-crawl'
230252
231-
const testXCrawl = xCrawl({ timeout: 10000 })
253+
const myXCrawl = xCrawl({ timeout: 10000 })
232254
233-
testXCrawl
255+
myXCrawl
234256
.crawlPage('https://xxx.com')
235257
.then(async (res) => {
236258
const { page } = res
237259
260+
// 获取页面渲染后的截图
238261
await page.screenshot({ path: './upload/page.png' })
239262
240263
console.log('获取屏幕截图完毕')
241264
})
242265
```
243266
244-
page 实例还可以做事件之类的交互操作,具体使用参考 [page](https://pptr.dev/api/puppeteer.page) 。
245-
246267
### 爬取接口
247268
248269
通过 [crawlData()](#crawlData) 爬取接口数据

package.json

+2-2
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,9 @@
11
{
22
"private": true,
33
"name": "x-crawl",
4-
"version": "3.2.4",
4+
"version": "3.2.5",
55
"author": "coderHXL",
6-
"description": "x-crawl is a flexible nodejs crawler library. ",
6+
"description": "x-crawl is a flexible nodejs crawler library.",
77
"license": "MIT",
88
"main": "src/index.ts",
99
"scripts": {

publish/README.md

+37-11
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
English | [简体中文](https://github.com/coder-hxl/x-crawl/blob/main/docs/cn.md)
44

5-
x-crawl is a flexible nodejs crawler library. Used to crawl pages, batch network requests, and batch download file resources. Crawl data in asynchronous or synchronous mode, 3 ways to get results, and 5 ways to write requestConfig. Runs on nodejs, friendly to JS/TS developers.
5+
x-crawl is a flexible nodejs crawler library. You can crawl pages and control operations such as pages, batch network requests, and batch downloads of file resources. Support asynchronous/synchronous mode crawling data. Running on nodejs, the usage is flexible and simple, friendly to JS/TS developers.
66

77
If you feel good, you can support [x-crawl repository](https://github.com/coder-hxl/x-crawl) with a Star.
88

@@ -11,8 +11,8 @@ If you feel good, you can support [x-crawl repository](https://github.com/coder-
1111
- Cules data for asynchronous/synchronous ways.
1212
- In three ways to obtain the results of the three ways of supporting Promise, Callback, and Promise + Callback.
1313
- RquestConfig has 5 ways of writing.
14-
- The anthropomorphic request interval time.
15-
- In a simple configuration, you can capture pages, JSON, file resources, and so on.
14+
- Flexible request interval.
15+
- Operations such as crawling pages, batch network requests, and batch downloading of file resources can be performed with simple configuration.
1616
- The rotation function, crawl regularly.
1717
- The built -in Puppeteer crawl the page and uses the JSDOM library to analyze the page, or it can also be parsed by itself.
1818
- Chopening with TypeScript, possessing type prompts, and providing generic types.
@@ -214,38 +214,64 @@ myXCrawl.crawlPage('https://xxx.com').then(res => {
214214

215215
#### jsdom instance
216216

217-
Refer to [jsdom](https://github.com/jsdom/jsdom) for specific usage.
217+
It is an instance object of [JSDOM](https://github.com/jsdom/jsdom), please refer to [jsdom](https://github.com/jsdom/jsdom) for specific usage.
218+
219+
**Note:** The jsdom instance only parses the content of [page instance](#page-instance), if you use page instance for event operation, you may need to parse the latest by yourself For details, please refer to the self-parsing page of [page instance](#page-instance).
218220

219221
#### browser instance
220222

221-
The browser instance is a headless browser without a UI shell. What he does is to bring **all modern network platform functions** provided by the browser rendering engine to the code.
223+
It is an instance object of [Browser](https://pptr.dev/api/puppeteer.browser). For specific usage, please refer to [Browser](https://pptr.dev/api/puppeteer.browser).
222224

223-
**Purpose of calling close:** The browser instance will always be running internally, causing the file not to be terminated. Do not call [crawlPage](#crawlPage) or [page](#page) if you need to use it later. When you modify the properties of a browser instance, it will affect the browser instance inside the crawlPage API of the crawler instance, the page instance that returns the result, and the browser instance, because the browser instance is shared within the crawlPage API of the same crawler instance.
225+
The browser instance is a headless browser without a UI shell. What he does is to bring **all modern network platform functions** provided by the browser rendering engine to the code.
224226

225-
Refer to [browser](https://pptr.dev/api/puppeteer.browser) for specific usage.
227+
**Note:** An event loop will always be generated inside the browser instance, causing the file not to be terminated. If you want to stop, you can execute browser.close() to close it. Do not call [crawlPage](#crawlPage) or [page](#page) if you need to use it later. Because when you modify the properties of the browser instance, it will affect the browser instance inside the crawlPage API of the crawler instance, the page instance that returns the result, and the browser instance, because the browser instance is shared within the crawlPage API of the same crawler instance.
226228

227229
#### page instance
228230

231+
It is an instance object of [Page](https://pptr.dev/api/puppeteer.page). The instance can also perform interactive operations such as events. For specific usage, please refer to [page](https://pptr.dev /api/puppeteer. page).
232+
233+
**Parse the page by yourself**
234+
235+
Take the jsdom library as an example:
236+
237+
```js
238+
import xCrawl from 'x-crawl'
239+
import { JSDOM } from 'jsdom'
240+
241+
const myXCrawl = xCrawl({ timeout: 10000 })
242+
243+
myXCrawl.crawlPage('https://www.xxx.com').then(async (res) => {
244+
const { page } = res
245+
246+
// Get the latest page content
247+
const content = await page.content()
248+
249+
// Use the jsdom library to parse it yourself
250+
const jsdom = new JSDOM(content)
251+
252+
console.log(jsdom.window.document.querySelector('title').textContent)
253+
})
254+
```
255+
229256
**Take Screenshot**
230257

231258
```js
232259
import xCrawl from 'x-crawl'
233260

234-
const testXCrawl = xCrawl({ timeout: 10000 })
261+
const myXCrawl = xCrawl({ timeout: 10000 })
235262

236-
testXCrawl
263+
myXCrawl
237264
.crawlPage('https://xxx.com')
238265
.then(async (res) => {
239266
const { page } = res
240267

268+
// Get a screenshot of the rendered page
241269
await page.screenshot({ path: './upload/page.png' })
242270

243271
console.log('Screen capture is complete')
244272
})
245273
```
246274

247-
The page instance can also perform interactive operations such as events. For details, refer to [page](https://pptr.dev/api/puppeteer.page).
248-
249275
### Crawl interface
250276

251277
Crawl interface data through [crawlData()](#crawlData)

publish/package.json

+3-2
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
{
22
"name": "x-crawl",
3-
"version": "3.2.4",
3+
"version": "3.2.5",
44
"author": "coderHXL",
55
"description": "x-crawl is a flexible nodejs crawler library.",
66
"license": "MIT",
@@ -9,7 +9,8 @@
99
"typescript",
1010
"crawl",
1111
"crawler",
12-
"spider"
12+
"spider",
13+
"flexible"
1314
],
1415
"main": "dist/index.js",
1516
"types": "dist/index.d.ts",

0 commit comments

Comments
 (0)