Skip to content

Commit 82d1b4f

Browse files
committed
Update: Docs
1 parent 9aa386f commit 82d1b4f

File tree

9 files changed

+122
-167
lines changed

9 files changed

+122
-167
lines changed

README.md

+22-30
Original file line numberDiff line numberDiff line change
@@ -2,24 +2,24 @@
22

33
English | [简体中文](https://github.com/coder-hxl/x-crawl/blob/main/docs/cn.md)
44

5-
x-crawl is a flexible nodejs crawler library. It can crawl pages, control pages, batch network requests, batch download file resources, polling and crawling, etc. Support asynchronous/synchronous mode crawling data. Running on nodejs, the usage is flexible and simple, friendly to JS/TS developers.
5+
x-crawl is a flexible nodejs crawler library. It can crawl pages in batches, network requests in batches, download file resources in batches, polling and crawling, etc. Supports asynchronous/synchronous mode crawling. Running on nodejs, the usage is flexible and simple, friendly to JS/TS developers.
66

77
> If you feel good, you can give [x-crawl repository](https://github.com/coder-hxl/x-crawl) a Star to support it, your Star will be the motivation for my update.
88
99
## Features
1010

11-
- Support asynchronous/synchronous way to crawl data.
12-
- Flexible writing, supporting multiple ways to write request configuration and obtain crawling results.
13-
- Flexible crawling interval, no interval/fixed interval/random interval, it is up to you to use/avoid high concurrent crawling.
14-
- Simple configuration can crawl pages, batch network requests, batch download file resources, polling and crawling, etc.
15-
- Crawl SPA (single-page application) to generate pre-rendered content (ie "SSR" (server-side rendering)), and use jsdom library to parse the content, and also supports self-parsing.
16-
- Form submissions, keystrokes, event actions, screenshots of generated pages, etc.
17-
- Capture and record the success and failure of crawling, and highlight the reminders.
18-
- Written in TypeScript, has types, provides generics.
11+
- **🔥 Asynchronous/Synchronous** - Support asynchronous/synchronous mode batch crawling.
12+
- **⚙️ Multiple functions** - Batch crawling of pages, batch network requests, batch download of file resources, polling crawling, etc.
13+
- **🖋️ Flexible writing style** - Multiple crawling configurations and ways to get crawling results.
14+
- **⏱️ Interval crawling** - no interval/fixed interval/random interval, you can use/avoid high concurrent crawling.
15+
- **☁️ Crawl SPA** - Batch crawl SPA (Single Page Application) to generate pre-rendered content (ie "SSR" (Server Side Rendering)).
16+
- **⚒️ Controlling Pages** - Headless browsers can submit forms, keystrokes, event actions, generate screenshots of pages, etc.
17+
- **🧾 Capture Record** - Capture and record the crawled results, and highlight the reminders.
18+
- **🦾TypeScript** - Own types, implement complete types through generics.
1919

2020
## Relationship with puppeteer
2121

22-
The crawlPage API internally uses the [puppeteer](https://github.com/puppeteer/puppeteer) library to help us crawl pages and expose Brower instances and Page instances, making it more flexible.
22+
The crawlPage API internally uses the [puppeteer](https://github.com/puppeteer/puppeteer) library to help us crawl pages and expose Brower instances and Page instances.
2323

2424
# Table of Contents
2525

@@ -31,7 +31,6 @@ The crawlPage API internally uses the [puppeteer](https://github.com/puppeteer/p
3131
- [Choose crawling mode](#Choose-crawling-mode)
3232
- [Multiple crawler application instances](#Multiple-crawler-application-instances)
3333
- [Crawl page](#Crawl-page)
34-
- [jsdom instance](#jsdom-instance)
3534
- [browser instance](#browser-instance)
3635
- [page instance](#page-instance)
3736
- [Crawl interface](#Crawl-interface)
@@ -130,7 +129,6 @@ running result:
130129
<div align="center">
131130
<img src="https://raw.githubusercontent.com/coder-hxl/x-crawl/main/assets/en/crawler-result.png" />
132131
</div>
133-
134132
**Note:** Do not crawl at will, you can check the **robots.txt** protocol before crawling. This is just to demonstrate how to use x-crawl.
135133

136134
## Core concepts
@@ -196,19 +194,13 @@ const myXCrawl = xCrawl({
196194
})
197195

198196
myXCrawl.crawlPage('https://xxx.com').then((res) => {
199-
const { jsdom, browser, page } = res
197+
const { browser, page } = res
200198

201199
// Close the browser
202200
browser.close()
203201
})
204202
```
205203

206-
#### jsdom instance
207-
208-
It is an instance object of [JSDOM](https://github.com/jsdom/jsdom), please refer to [jsdom](https://github.com/jsdom/jsdom) for specific usage.
209-
210-
**Note:** The jsdom instance only parses the content of [page instance](#page-instance), if you use page instance for event operation, you may need to parse the latest by yourself For details, please refer to the self-parsing page of [page instance](#page-instance).
211-
212204
#### browser instance
213205

214206
It is an instance object of [Browser](https://pptr.dev/api/puppeteer.browser). For specific usage, please refer to [Browser](https://pptr.dev/api/puppeteer.browser).
@@ -327,7 +319,7 @@ const myXCrawl = xCrawl({
327319
myXCrawl.startPolling({ h: 2, m: 30 }, async (count, stopPolling) => {
328320
// will be executed every two and a half hours
329321
// crawlPage/crawlData/crawlFile
330-
const { jsdom, browser, page } = await myXCrawl.crawlPage('https://xxx.com')
322+
const { browser, page } = await myXCrawl.crawlPage('https://xxx.com')
331323
page.close()
332324
})
333325
```
@@ -521,10 +513,10 @@ crawlPage is the method of the crawler instance, usually used to crawl page.
521513
- Look at the [CrawlPage](#CrawlPage-1) type
522514

523515
```ts
524-
function crawlPage: (
525-
config: CrawlPageConfig,
526-
callback?: (res: CrawlPage) => void
527-
) => Promise<CrawlPage>
516+
function crawlPage<T extends CrawlPageConfig = CrawlPageConfig>(
517+
config: T,
518+
callback?: ((res: CrawlPage) => void) | undefined
519+
): Promise<T extends string[] | CrawlBaseConfigV1[] ? CrawlPage[] : CrawlPage>
528520
```
529521

530522
#### Example
@@ -536,8 +528,7 @@ const myXCrawl = xCrawl({ timeout: 10000 })
536528
537529
// crawlPage API
538530
myXCrawl.crawlPage('https://xxx.com/xxxx').then((res) => {
539-
const { jsdom, browser, page } = res
540-
console.log(jsdom.window.document.querySelector('title')?.textContent)
531+
const { browser, page } = res
541532
542533
// Close the browser
543534
browser.close()
@@ -801,10 +792,12 @@ interface StartPollingConfig {
801792

802793
```js
803794
interface XCrawlInstance {
804-
crawlPage: (
805-
config: CrawlPageConfig,
795+
crawlPage: <T extends CrawlPageConfig = CrawlPageConfig>(
796+
config: T,
806797
callback?: (res: CrawlPage) => void
807-
) => Promise<CrawlPage>
798+
) => Promise<
799+
T extends string[] | CrawlBaseConfigV1[] ? CrawlPage[] : CrawlPage
800+
>
808801
809802
crawlData: <T = any>(
810803
config: CrawlDataConfig,
@@ -847,7 +840,6 @@ interface CrawlPage {
847840
httpResponse: HTTPResponse | null // The HTTPResponse is from the puppeteer library
848841
browser: Browser // The Browser is from the puppeteer library
849842
page: Page // The Page is from the puppeteer library
850-
jsdom: JSDOM // The JSDOM is from the jsdom library
851843
}
852844
```
853845

docs/cn.md

+24-31
Original file line numberDiff line numberDiff line change
@@ -2,24 +2,24 @@
22

33
[English](https://github.com/coder-hxl/x-crawl#x-crawl) | 简体中文
44

5-
x-crawl 是一个灵活的 nodejs 爬虫库。可以爬取页面、控制页面、批量网络请求、批量下载文件资源、轮询爬取等。支持 异步/同步 模式爬取数据。跑在 nodejs 上,用法灵活和简单,对 JS/TS 开发者友好。
5+
x-crawl 是一个灵活的 nodejs 爬虫库。可批量爬取页面、批量网络请求、批量下载文件资源、轮询爬取等。支持 异步/同步 模式爬取。跑在 nodejs 上,用法灵活和简单,对 JS/TS 开发者友好。
66

77
> 如果感觉不错,可以给 [x-crawl 存储库](https://github.com/coder-hxl/x-crawl) 点个 Star 支持一下,您的 Star 将是我更新的动力。
88
99
## 特征
1010

11-
- 支持 异步/同步 方式爬取数据
12-
- 灵活的写法,支持多种方式写请求配置和获取爬取结果
13-
- 灵活的爬取间隔时间,无间隔/固定间隔/随机间隔,由你决定 使用/避免 高并发爬取
14-
- 简单的配置即可抓取页面、批量网络请求、批量下载文件资源、轮询爬取等
15-
- 爬取 SPA(单页应用程序)生成预渲染内容(即“SSR”(服务器端渲染)),并采用 jsdom 库对内容解析,也支持自行解析
16-
- 表单提交、键盘输入、事件操作、生成页面的屏幕截图等。
17-
- 对爬取的成功和失败进行捕获记录,并进行高亮的提醒。
18-
- 使用 TypeScript 编写,拥有类型,提供泛型
11+
- **🔥 异步/同步** - 支持 异步/同步 模式批量爬取
12+
- **⚙️ 多种功能** - 可批量爬取页面、批量网络请求、批量下载文件资源、轮询爬取等
13+
- **🖋️ 写法灵活** - 多种爬取配置、获取爬取结果的写法
14+
- **⏱️ 间隔爬取** - 无间隔/固定间隔/随机间隔,可以 使用/避免 高并发爬取
15+
- **☁️ 爬取 SPA** - 批量爬取 SPA(单页应用程序)生成预渲染内容(即“SSR”(服务器端渲染))。
16+
- **⚒️ 控制页面** - 无头浏览器可以表单提交、键盘输入、事件操作、生成页面的屏幕截图等。
17+
- **🧾 捕获记录** - 对爬取的结果进行捕获记录,并进行高亮的提醒。
18+
- **🦾TypeScript** - 拥有类型,通过泛型实现完整的类型
1919

2020
## 跟 puppeteer 的关系
2121

22-
crawlPage API 内部使用 [puppeteer](https://github.com/puppeteer/puppeteer) 库来帮助我们爬取页面,并将 Brower 实例和 Page 实例暴露出来,更加灵活
22+
crawlPage API 内部使用 [puppeteer](https://github.com/puppeteer/puppeteer) 库来帮助我们爬取页面,并将 Brower 实例和 Page 实例暴露出来。
2323

2424
# 目录
2525

@@ -31,7 +31,6 @@ crawlPage API 内部使用 [puppeteer](https://github.com/puppeteer/puppeteer)
3131
- [选择爬取模式](#选择爬取模式)
3232
- [多个爬虫应用实例](#多个爬虫应用实例)
3333
- [爬取页面](#爬取页面)
34-
- [jsdom 实例](#jsdom-实例)
3534
- [browser 实例](#browser-实例)
3635
- [page 实例](#page-实例)
3736
- [爬取接口](#爬取接口)
@@ -105,7 +104,7 @@ myXCrawl.startPolling({ d: 1 }, async (count, stopPolling) => {
105104
// 调用 crawlPage API 爬取 Page
106105
const { page } = await myXCrawl.crawlPage('https://www.bilibili.com/guochuang/')
107106

108-
// 获取轮播图片元素的 URL ,设置请求配置
107+
// 设置请求配置,获取轮播图片的 URL
109108
const requestConfig = await page.$$eval('.chief-recom-item img', (imgEls) =>
110109
imgEls.map((item) => item.src)
111110
)
@@ -191,19 +190,13 @@ import xCrawl from 'x-crawl'
191190
const myXCrawl = xCrawl({ timeout: 10000 })
192191

193192
myXCrawl.crawlPage('https://xxx.com').then((res) => {
194-
const { jsdom, browser, page } = res
193+
const { browser, page } = res
195194

196195
// 关闭浏览器
197196
browser.close()
198197
})
199198
```
200199

201-
#### jsdom 实例
202-
203-
它是 [JSDOM](https://github.com/jsdom/jsdom) 的实例对象,具体使用可以参考 [jsdom](https://github.com/jsdom/jsdom)
204-
205-
**注意:** jsdom 实例只是对 [page 实例](#page-实例) 的 content 进行了解析,如果您使用 page 实例进行了事件操作的话,可能需要自行解析最新的页面内容,具体操作可查看 [page 实例](#page-实例) 的自行解析页面。
206-
207200
#### browser 实例
208201

209202
它是 [Browser](https://pptr.dev/api/puppeteer.browser) 的实例对象,具体使用可以参考 [Browser](https://pptr.dev/api/puppeteer.browser)
@@ -321,7 +314,7 @@ const myXCrawl = xCrawl({
321314
myXCrawl.startPolling({ h: 2, m: 30 }, async (count, stopPolling) => {
322315
// 每隔两个半小时会执行一次
323316
// crawlPage/crawlData/crawlFile
324-
const { jsdom, browser, page } = await myXCrawl.crawlPage('https://xxx.com')
317+
const { browser, page } = await myXCrawl.crawlPage('https://xxx.com')
325318
page.close()
326319
})
327320
```
@@ -513,10 +506,10 @@ crawlPage 是爬虫实例的方法,通常用于爬取页面。
513506
- 查看 [CrawlPage](#CrawlPage-1) 类型
514507

515508
```ts
516-
function crawlPage: (
517-
config: CrawlPageConfig,
518-
callback?: (res: CrawlPage) => void
519-
) => Promise<CrawlPage>
509+
function crawlPage<T extends CrawlPageConfig = CrawlPageConfig>(
510+
config: T,
511+
callback?: ((res: CrawlPage) => void) | undefined
512+
): Promise<T extends string[] | CrawlBaseConfigV1[] ? CrawlPage[] : CrawlPage>
520513
```
521514

522515
#### 示例
@@ -528,8 +521,7 @@ const myXCrawl = xCrawl({ timeout: 10000 })
528521
529522
// crawlPage API
530523
myXCrawl.crawlPage('https://xxx.com/xxx').then((res) => {
531-
const { jsdom, browser, page } = res
532-
console.log(jsdom.window.document.querySelector('title')?.textContent)
524+
const { browser, page } = res
533525
534526
// 关闭浏览器
535527
browser.close()
@@ -760,7 +752,7 @@ interface CrawlBaseConfigV2 {
760752
### CrawlPageConfig
761753

762754
```ts
763-
type CrawlPageConfig = string | CrawlBaseConfigV1
755+
type CrawlPageConfig = string | string[] | CrawlBaseConfigV1 | CrawlBaseConfigV1[]
764756
```
765757

766758
### CrawlDataConfig
@@ -794,10 +786,12 @@ interface StartPollingConfig {
794786

795787
```js
796788
interface XCrawlInstance {
797-
crawlPage: (
798-
config: CrawlPageConfig,
789+
crawlPage: <T extends CrawlPageConfig = CrawlPageConfig>(
790+
config: T,
799791
callback?: (res: CrawlPage) => void
800-
) => Promise<CrawlPage>
792+
) => Promise<
793+
T extends string[] | CrawlBaseConfigV1[] ? CrawlPage[] : CrawlPage
794+
>
801795
802796
crawlData: <T = any>(
803797
config: CrawlDataConfig,
@@ -840,7 +834,6 @@ interface CrawlPage {
840834
httpResponse: HTTPResponse | null // puppeteer 库的 HTTPResponse 类型
841835
browser: Browser // puppeteer 库的 Browser 类型
842836
page: Page // puppeteer 库的 Page 类型
843-
jsdom: JSDOM // jsdom 库的 JSDOM 类型
844837
}
845838
```
846839

package.json

+1-1
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
{
22
"private": true,
33
"name": "x-crawl",
4-
"version": "3.3.0",
4+
"version": "4.0.0",
55
"author": "coderHXL",
66
"description": "x-crawl is a flexible nodejs crawler library.",
77
"license": "MIT",

0 commit comments

Comments
 (0)