Skip to content

Commit f9a9184

Browse files
committed
Bugfix: The result page of crawlPage API reported an error. Expose the browser and let the user decide to close the browser
1 parent cb1fe49 commit f9a9184

File tree

11 files changed

+175
-102
lines changed

11 files changed

+175
-102
lines changed

README.md

+36-13
Original file line numberDiff line numberDiff line change
@@ -36,6 +36,9 @@ We can do the following:
3636
+ [Choose crawling mode](#Choose-crawling-mode)
3737
+ [Multiple crawler application instances](#Multiple-crawler-application-instances)
3838
* [Crawl page](#Crawl-page)
39+
+ [jsdom](#jsdom)
40+
+ [browser](#browser)
41+
+ [page](#page)
3942
* [Crawl interface](#Crawl-interface)
4043
* [Crawl files](#Crawl-files)
4144
* [Start polling](#Start-polling)
@@ -49,7 +52,6 @@ We can do the following:
4952
* [crawlPage](#crawlPage)
5053
+ [Type](#Type-2)
5154
+ [Example](#Example-2)
52-
+ [About page](#About-page)
5355
* [crawlData](#crawlData)
5456
+ [Type](#Type-3)
5557
+ [Example](#Example-3)
@@ -107,7 +109,7 @@ const myXCrawl = xCrawl({
107109
myXCrawl.startPolling({ d: 1 }, () => {
108110
// Call crawlPage API to crawl Page
109111
myXCrawl.crawlPage('https://www.youtube.com/').then((res) => {
110-
const { jsdom } = res.data // By default, the JSDOM library is used to parse Page
112+
const { browser, jsdom } = res // By default, the JSDOM library is used to parse Page
111113

112114
// Get the cover image element of the Promoted Video
113115
const imgEls = jsdom.window.document.querySelectorAll(
@@ -127,6 +129,9 @@ myXCrawl.startPolling({ d: 1 }, () => {
127129
requestConfig,
128130
fileConfig: { storeDir: path.resolve(__dirname, './upload') }
129131
})
132+
133+
// Close the browser
134+
browser.close()
130135
})
131136
})
132137
```
@@ -206,10 +211,27 @@ const myXCrawl = xCrawl({
206211
})
207212

208213
myXCrawl.crawlPage('https://xxx.com').then(res => {
209-
const { jsdom, page } = res.data
214+
const { jsdom, browser, page } = res
215+
216+
// Close the browser
217+
browser.close()
210218
})
211219
```
212220

221+
#### jsdom
222+
223+
Refer to [jsdom](https://github.com/jsdom/jsdom) for specific usage.
224+
225+
#### browser
226+
227+
**Purpose of calling close: **browser will keep running, so the file will not be terminated. Do not call [crawlPage](#crawlPage) or [page](#page) if you need to use it later. When you modify the properties of the browser object, it will affect the browser inside the crawlPage of the crawler instance, the returned page, and the browser, because the browser is shared within the crawlPage API of the crawler instance.
228+
229+
Refer to [browser](https://pptr.dev/api/puppeteer.browser) for specific usage.
230+
231+
#### page
232+
233+
The page attribute can be used for interactive operations such as events. For details, refer to [page](https://pptr.dev/api/puppeteer.page).
234+
213235
### Crawl interface
214236

215237
Crawl interface data through [crawlData()](#crawlData)
@@ -276,7 +298,10 @@ myXCrawl. startPolling({ h: 2, m: 30 }, (count, stopPolling) => {
276298
// will be executed every two and a half hours
277299
// crawlPage/crawlData/crawlFile
278300
myXCrawl.crawlPage('https://xxx.com').then(res => {
279-
const { jsdom, page } = res.data
301+
const { jsdom, browser, page } = res
302+
303+
// Close the browser
304+
browser.close()
280305
})
281306
})
282307
```
@@ -480,15 +505,14 @@ const myXCrawl = xCrawl({ timeout: 10000 })
480505
481506
// crawlPage API
482507
myXCrawl.crawlPage('https://xxx.com/xxxx').then((res) => {
483-
const { jsdom, page } = res.data
508+
const { jsdom, browser, page } = res
484509
console.log(jsdom.window.document.querySelector('title')?.textContent)
510+
511+
// Close the browser
512+
browser.close()
485513
})
486514
```
487515

488-
#### About page
489-
490-
The page attribute can be used for interactive operations such as events. For details, refer to [page](https://pptr.dev/api/puppeteer.page).
491-
492516
### crawlData
493517

494518
crawlData is the method of the crawler instance, which is usually used to crawl APIs to obtain JSON data and so on.
@@ -771,10 +795,9 @@ interface FileInfo {
771795
```ts
772796
interface CrawlPage {
773797
httpResponse: HTTPResponse | null // The type of HTTPResponse in the puppeteer library
774-
data: {
775-
page: Page // The type of Page in the puppeteer library
776-
jsdom: JSDOM // The type of JSDOM in the jsdom library
777-
}
798+
browser // The type of Browser in the puppeteer library
799+
page: Page // The type of Page in the puppeteer library
800+
jsdom: JSDOM // The type of JSDOM in the jsdom library
778801
}
779802
```
780803

docs/cn.md

+42-13
Original file line numberDiff line numberDiff line change
@@ -29,19 +29,27 @@ crawlPage API 内部使用 [puppeteer](https://github.com/puppeteer/puppeteer)
2929
# 目录
3030

3131
- [安装](#安装)
32+
3233
- [示例](#示例)
34+
3335
- [核心概念](#核心概念)
3436
* [创建应用](#创建应用)
3537
+ [一个爬虫应用实例](#一个爬虫应用实例)
3638
+ [选择爬取模式](#选择爬取模式)
3739
+ [多个爬虫应用实例](#多个爬虫应用实例)
3840
* [爬取页面](#爬取页面)
41+
42+
+ [jsdom](#jsdom)
43+
+ [browser](#browser)
44+
+ [page](#page)
45+
3946
* [爬取接口](#爬取接口)
4047
* [爬取文件](#爬取文件)
4148
* [启动轮询](#启动轮询)
4249
* [请求间隔时间](#请求间隔时间)
4350
* [requestConfig 选项的多种写法](#requestConfig-选项的多种写法)
4451
* [获取结果的多种方式](#获取结果的多种方式)
52+
4553
- [API](#API)
4654
* [xCrawl](#xCrawl)
4755
+ [类型](#类型-1)
@@ -52,13 +60,13 @@ crawlPage API 内部使用 [puppeteer](https://github.com/puppeteer/puppeteer)
5260
* [crawlData](#crawlData)
5361
+ [类型](#类型-3)
5462
+ [示例](#示例-3)
55-
+ [关于 page](#关于-page)
5663
* [crawlFile](#crawlFile)
5764
+ [类型](#类型-4)
5865
+ [示例](#示例-4)
5966
* [startPolling](#startPolling)
6067
+ [类型](#类型-5)
6168
+ [示例](#示例-5)
69+
6270
- [类型](#类型-6)
6371
* [AnyObject](#AnyObject)
6472
* [Method](#Method)
@@ -76,6 +84,7 @@ crawlPage API 内部使用 [puppeteer](https://github.com/puppeteer/puppeteer)
7684
* [CrawlResCommonArrV1](#CrawlResCommonArrV1)
7785
* [FileInfo](#FileInfo)
7886
* [CrawlPage](#CrawlPage)
87+
7988
- [更多](#更多)
8089

8190
## 安装
@@ -106,7 +115,7 @@ const myXCrawl = xCrawl({
106115
myXCrawl.startPolling({ d: 1 }, () => {
107116
// 调用 crawlPage API 爬取 Page
108117
myXCrawl.crawlPage('https://www.bilibili.com/guochuang/').then((res) => {
109-
const { jsdom } = res.data // 默认使用了 JSDOM 库解析 Page
118+
const { browser, jsdom } = res // 默认使用了 JSDOM 库解析 Page
110119
111120
// 获取轮播图片元素
112121
const imgEls = jsdom.window.document.querySelectorAll('.chief-recom-item img')
@@ -120,6 +129,9 @@ myXCrawl.startPolling({ d: 1 }, () => {
120129
requestConfig,
121130
fileConfig: { storeDir: path.resolve(__dirname, './upload') }
122131
})
132+
133+
// 关闭浏览器
134+
browser.close()
123135
})
124136
})
125137
```
@@ -198,10 +210,27 @@ import xCrawl from 'x-crawl'
198210
const myXCrawl = xCrawl({ timeout: 10000 })
199211
200212
myXCrawl.crawlPage('https://xxx.com').then(res => {
201-
const { jsdom, page } = res.data
213+
const { jsdom, browser, page } = res
214+
215+
// 关闭浏览器
216+
browser.close()
202217
})
203218
```
204219
220+
#### jsdom
221+
222+
具体使用参考 [jsdom](https://github.com/jsdom/jsdom) 。
223+
224+
#### browser
225+
226+
**调用 close 的目的:**browser 会一直保持运行,造成文件不会终止。如果后面还需要用到 [crawlPage](#crawlPage) 或者 [page](#page) 请勿调用。当您修改 browser 对象的属性时,会对该爬虫实例的 crawlPage 内部的 browser 和返回的 page 以及 browser 造成影响,因为 browser 在爬虫实例的 crawlPage API 内是共享的。
227+
228+
具体使用参考 [browser](https://pptr.dev/api/puppeteer.browser) 。
229+
230+
#### page
231+
232+
page 属性可以做事件之类的交互操作,具体使用参考 [page](https://pptr.dev/api/puppeteer.page) 。
233+
205234
### 爬取接口
206235
207236
通过 [crawlData()](#crawlData) 爬取接口数据
@@ -268,7 +297,9 @@ myXCrawl.startPolling({ h: 2, m: 30 }, (count, stopPolling) => {
268297
// 每隔两个半小时会执行一次
269298
// crawlPage/crawlData/crawlFile
270299
myXCrawl.crawlPage('https://xxx.com').then(res => {
271-
const { jsdom, page } = res.data
300+
const { jsdom, browser, page } = res
301+
302+
browser.close()
272303
})
273304
})
274305
```
@@ -471,15 +502,14 @@ const myXCrawl = xCrawl({ timeout: 10000 })
471502
472503
// crawlPage API
473504
myXCrawl.crawlPage('https://xxx.com/xxx').then((res) => {
474-
const { jsdom, page } = res.data
505+
const { jsdom, browser, page } = res
475506
console.log(jsdom.window.document.querySelector('title')?.textContent)
507+
508+
// 关闭浏览器
509+
browser.close()
476510
})
477511
```
478512
479-
#### 关于 page
480-
481-
page 属性可以做事件之类的交互操作,具体使用参考 [page](https://pptr.dev/api/puppeteer.page) 。
482-
483513
### crawlData
484514
485515
crawl 是爬虫实例的方法,通常用于爬取 API ,可获取 JSON 数据等等。
@@ -764,10 +794,9 @@ interface FileInfo {
764794
```ts
765795
interface CrawlPage {
766796
httpResponse: HTTPResponse | null // puppeteer 库的 HTTPResponse 类型
767-
data: {
768-
page: Page // puppeteer 库的 Page 类型
769-
jsdom: JSDOM // jsdom 库的 JSDOM 类型
770-
}
797+
browser: Browser // puppeteer 库的 Browser 类型
798+
page: Page // puppeteer 库的 Page 类型
799+
jsdom: JSDOM // jsdom 库的 JSDOM 类型
771800
}
772801
```
773802

package.json

+3-3
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
{
22
"private": true,
33
"name": "x-crawl",
4-
"version": "3.1.2",
4+
"version": "3.2.0",
55
"author": "coderHXL",
66
"description": "x-crawl is a flexible nodejs crawler library. ",
77
"license": "MIT",
@@ -11,8 +11,8 @@
1111
"build-dts": "tsc && prettier --write ./publish/src",
1212
"build-strict": "pnpm test-dev && pnpm build && pnpm test-pro",
1313
"start": "rollup --config script/start.mjs",
14-
"test-dev": "jest test/modal/test.ts dev",
15-
"test-pro": "jest test/modal/test.ts pro",
14+
"test-dev": "jest test/modal/test.ts dev --detectOpenHandles",
15+
"test-pro": "jest test/modal/test.ts pro --detectOpenHandles",
1616
"prettier": "prettier --write ."
1717
},
1818
"dependencies": {

0 commit comments

Comments
 (0)