Skip to content

Commit e9f4dba

Browse files
committed
Update: Docs
1 parent 1cb1835 commit e9f4dba

File tree

7 files changed

+46
-74
lines changed

7 files changed

+46
-74
lines changed

README.md

+19-30
Original file line numberDiff line numberDiff line change
@@ -2,30 +2,24 @@
22

33
English | [简体中文](https://github.com/coder-hxl/x-crawl/blob/main/docs/cn.md)
44

5-
x-crawl is a flexible nodejs crawler library. You can crawl pages and control operations such as pages, batch network requests, and batch downloads of file resources. Support asynchronous/synchronous mode crawling data. Running on nodejs, the usage is flexible and simple, friendly to JS/TS developers.
5+
x-crawl is a flexible nodejs crawler library. It can crawl pages, control pages, batch network requests, batch download file resources, polling and crawling, etc. Support asynchronous/synchronous mode crawling data. Running on nodejs, the usage is flexible and simple, friendly to JS/TS developers.
66

77
> If you feel good, you can give [x-crawl repository](https://github.com/coder-hxl/x-crawl) a Star to support it, your Star will be the motivation for my update.
88
99
## Features
1010

1111
- Support asynchronous/synchronous way to crawl data.
12-
- Flexible writing, support a variety of ways to write request configuration and obtain crawl results.
13-
- Flexible crawling interval, up to you to use/avoid high concurrent crawling.
14-
- With simple configuration, operations such as crawling pages, batch network requests, and batch download of file resources can be performed.
15-
- Possess polling function to crawl data regularly.
16-
- The built-in puppeteer crawls the page, and uses the jsdom library to analyze the content of the page, and also supports self-analysis.
17-
- Capture the success and failure of the climb and highlight the reminder.
12+
- Flexible writing, supporting multiple ways to write request configuration and obtain crawling results.
13+
- Flexible crawling interval, no interval/fixed interval/random interval, it is up to you to use/avoid high concurrent crawling.
14+
- Simple configuration can crawl pages, batch network requests, batch download file resources, polling and crawling, etc.
15+
- Crawl SPA (single-page application) to generate pre-rendered content (ie "SSR" (server-side rendering)), and use jsdom library to parse the content, and also supports self-parsing.
16+
- Form submissions, keystrokes, event actions, screenshots of generated pages, etc.
17+
- Capture and record the success and failure of crawling, and highlight the reminders.
1818
- Written in TypeScript, has types, provides generics.
1919

20-
## Relationship with puppeteer
20+
## Relationship with puppeteer
2121

22-
The crawlPage API internally uses the [puppeteer](https://github.com/puppeteer/puppeteer) library to help us crawl pages.
23-
24-
The return value of the crawlPage API will be able to do the following:
25-
26-
- Generate screenshots and PDFs of pages.
27-
- Crawl a SPA (Single-Page Application) and generate pre-rendered content (i.e. "SSR" (Server-Side Rendering)).
28-
- Automate form submission, UI testing, keyboard input, etc.
22+
The crawlPage API internally uses the [puppeteer](https://github.com/puppeteer/puppeteer) library to help us crawl pages and expose Brower instances and Page instances, making it more flexible.
2923

3024
# Table of Contents
3125

@@ -91,7 +85,7 @@ npm install x-crawl
9185

9286
## Example
9387

94-
Regular crawling: Get the recommended pictures of the youtube homepage every other day as an example:
88+
Timing capture: Take the automatic capture of the cover image of Airbnb Plus listings every day as an example:
9589

9690
```js
9791
// 1.Import module ES/CJS
@@ -105,23 +99,18 @@ const myXCrawl = xCrawl({
10599

106100
// 3.Set the crawling task
107101
// Call the startPolling API to start the polling function, and the callback function will be called every other day
108-
myXCrawl.startPolling({ d: 1 }, () => {
109-
// Call crawlPage API to crawl Page
110-
myXCrawl.crawlPage('https://www.youtube.com/').then((res) => {
111-
const { browser, jsdom } = res // By default, the JSDOM library is used to parse Page
102+
myXCrawl.startPolling({ d: 1 }, (count, stopPolling) => {
103+
myXCrawl.crawlPage('https://zh.airbnb.com/s/*/plus_homes').then((res) => {
104+
const { jsdom } = res // By default, the JSDOM library is used to parse Page
112105

113-
// Get the cover image element of the Promoted Video
114-
const imgEls = jsdom.window.document.querySelectorAll(
115-
'.yt-core-image--fill-parent-width'
116-
)
106+
// Get the cover image elements for Plus listings
107+
const imgEls = jsdom.window.document
108+
.querySelector('.a1stauiv')
109+
?.querySelectorAll('picture img')
117110

118111
// set request configuration
119-
const requestConfig = []
120-
imgEls.forEach((item) => {
121-
if (item.src) {
122-
requestConfig.push(item.src)
123-
}
124-
})
112+
const requestConfig: string[] = []
113+
imgEls?.forEach((item) => requestConfig.push(item.src))
125114

126115
// Call the crawlFile API to crawl pictures
127116
myXCrawl.crawlFile({ requestConfig, fileConfig: { storeDir: './upload' } })

assets/en/crawler-result.png

148 KB
Loading

assets/en/crawler.png

28.3 KB
Loading

docs/cn.md

+6-12
Original file line numberDiff line numberDiff line change
@@ -2,30 +2,24 @@
22

33
[English](https://github.com/coder-hxl/x-crawl#x-crawl) | 简体中文
44

5-
x-crawl 是一个灵活的 nodejs 爬虫库。可以爬取页面并控制页面、批量网络请求以及批量下载文件资源等操作。支持 异步/同步 模式爬取数据。跑在 nodejs 上,用法灵活和简单,对 JS/TS 开发者友好。
5+
x-crawl 是一个灵活的 nodejs 爬虫库。可以爬取页面、控制页面、批量网络请求、批量下载文件资源、轮询爬取等。支持 异步/同步 模式爬取数据。跑在 nodejs 上,用法灵活和简单,对 JS/TS 开发者友好。
66

77
> 如果感觉不错,可以给 [x-crawl 存储库](https://github.com/coder-hxl/x-crawl) 点个 Star 支持一下,您的 Star 将是我更新的动力。
88
99
## 特征
1010

1111
- 支持 异步/同步 方式爬取数据。
1212
- 灵活的写法,支持多种方式写请求配置和获取爬取结果。
13-
- 灵活的爬取间隔时间,由你决定 使用/避免 高并发爬取。
14-
- 简单的配置即可抓取页面、批量网络请求以及批量下载文件资源等操作
15-
- 拥有轮询功能,定时爬取数据
16-
- 内置 puppeteer 爬取页面,并用采用 jsdom 库对页面内容解析,也支持自行解析
13+
- 灵活的爬取间隔时间,无间隔/固定间隔/随机间隔,由你决定 使用/避免 高并发爬取。
14+
- 简单的配置即可抓取页面、批量网络请求、批量下载文件资源、轮询爬取等
15+
- 爬取 SPA(单页应用程序)生成预渲染内容(即“SSR”(服务器端渲染)),并采用 jsdom 库对内容解析,也支持自行解析
16+
- 表单提交、键盘输入、事件操作、生成页面的屏幕截图等
1717
- 对爬取的成功和失败进行捕获记录,并进行高亮的提醒。
1818
- 使用 TypeScript 编写,拥有类型,提供泛型。
1919

2020
## 跟 puppeteer 的关系
2121

22-
crawlPage API 内部使用 [puppeteer](https://github.com/puppeteer/puppeteer) 库来帮助我们爬取页面。
23-
24-
crawlPage API 的返回值将可以做以下操作:
25-
26-
- 生成页面的屏幕截图和 PDF。
27-
- 抓取 SPA(单页应用程序)并生成预渲染内容(即“SSR”(服务器端渲染))。
28-
- 自动化表单提交、UI 测试、键盘输入等。
22+
crawlPage API 内部使用 [puppeteer](https://github.com/puppeteer/puppeteer) 库来帮助我们爬取页面,并将 Brower 实例和 Page 实例暴露出来,更加灵活。
2923

3024
# 目录
3125

package.json

+1-1
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
{
22
"private": true,
33
"name": "x-crawl",
4-
"version": "3.2.9",
4+
"version": "3.2.10",
55
"author": "coderHXL",
66
"description": "x-crawl is a flexible nodejs crawler library.",
77
"license": "MIT",

publish/README.md

+19-30
Original file line numberDiff line numberDiff line change
@@ -2,30 +2,24 @@
22

33
English | [简体中文](https://github.com/coder-hxl/x-crawl/blob/main/docs/cn.md)
44

5-
x-crawl is a flexible nodejs crawler library. You can crawl pages and control operations such as pages, batch network requests, and batch downloads of file resources. Support asynchronous/synchronous mode crawling data. Running on nodejs, the usage is flexible and simple, friendly to JS/TS developers.
5+
x-crawl is a flexible nodejs crawler library. It can crawl pages, control pages, batch network requests, batch download file resources, polling and crawling, etc. Support asynchronous/synchronous mode crawling data. Running on nodejs, the usage is flexible and simple, friendly to JS/TS developers.
66

77
> If you feel good, you can give [x-crawl repository](https://github.com/coder-hxl/x-crawl) a Star to support it, your Star will be the motivation for my update.
88
99
## Features
1010

1111
- Support asynchronous/synchronous way to crawl data.
12-
- Flexible writing, support a variety of ways to write request configuration and obtain crawl results.
13-
- Flexible crawling interval, up to you to use/avoid high concurrent crawling.
14-
- With simple configuration, operations such as crawling pages, batch network requests, and batch download of file resources can be performed.
15-
- Possess polling function to crawl data regularly.
16-
- The built-in puppeteer crawls the page, and uses the jsdom library to analyze the content of the page, and also supports self-analysis.
17-
- Capture the success and failure of the climb and highlight the reminder.
12+
- Flexible writing, supporting multiple ways to write request configuration and obtain crawling results.
13+
- Flexible crawling interval, no interval/fixed interval/random interval, it is up to you to use/avoid high concurrent crawling.
14+
- Simple configuration can crawl pages, batch network requests, batch download file resources, polling and crawling, etc.
15+
- Crawl SPA (single-page application) to generate pre-rendered content (ie "SSR" (server-side rendering)), and use jsdom library to parse the content, and also supports self-parsing.
16+
- Form submissions, keystrokes, event actions, screenshots of generated pages, etc.
17+
- Capture and record the success and failure of crawling, and highlight the reminders.
1818
- Written in TypeScript, has types, provides generics.
1919

20-
## Relationship with puppeteer
20+
## Relationship with puppeteer
2121

22-
The crawlPage API internally uses the [puppeteer](https://github.com/puppeteer/puppeteer) library to help us crawl pages.
23-
24-
The return value of the crawlPage API will be able to do the following:
25-
26-
- Generate screenshots and PDFs of pages.
27-
- Crawl a SPA (Single-Page Application) and generate pre-rendered content (i.e. "SSR" (Server-Side Rendering)).
28-
- Automate form submission, UI testing, keyboard input, etc.
22+
The crawlPage API internally uses the [puppeteer](https://github.com/puppeteer/puppeteer) library to help us crawl pages and expose Brower instances and Page instances, making it more flexible.
2923

3024
# Table of Contents
3125

@@ -91,7 +85,7 @@ npm install x-crawl
9185

9286
## Example
9387

94-
Regular crawling: Get the recommended pictures of the youtube homepage every other day as an example:
88+
Timing capture: Take the automatic capture of the cover image of Airbnb Plus listings every day as an example:
9589

9690
```js
9791
// 1.Import module ES/CJS
@@ -105,23 +99,18 @@ const myXCrawl = xCrawl({
10599

106100
// 3.Set the crawling task
107101
// Call the startPolling API to start the polling function, and the callback function will be called every other day
108-
myXCrawl.startPolling({ d: 1 }, () => {
109-
// Call crawlPage API to crawl Page
110-
myXCrawl.crawlPage('https://www.youtube.com/').then((res) => {
111-
const { browser, jsdom } = res // By default, the JSDOM library is used to parse Page
102+
myXCrawl.startPolling({ d: 1 }, (count, stopPolling) => {
103+
myXCrawl.crawlPage('https://zh.airbnb.com/s/*/plus_homes').then((res) => {
104+
const { jsdom } = res // By default, the JSDOM library is used to parse Page
112105

113-
// Get the cover image element of the Promoted Video
114-
const imgEls = jsdom.window.document.querySelectorAll(
115-
'.yt-core-image--fill-parent-width'
116-
)
106+
// Get the cover image elements for Plus listings
107+
const imgEls = jsdom.window.document
108+
.querySelector('.a1stauiv')
109+
?.querySelectorAll('picture img')
117110

118111
// set request configuration
119-
const requestConfig = []
120-
imgEls.forEach((item) => {
121-
if (item.src) {
122-
requestConfig.push(item.src)
123-
}
124-
})
112+
const requestConfig: string[] = []
113+
imgEls?.forEach((item) => requestConfig.push(item.src))
125114

126115
// Call the crawlFile API to crawl pictures
127116
myXCrawl.crawlFile({ requestConfig, fileConfig: { storeDir: './upload' } })

publish/package.json

+1-1
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
{
22
"name": "x-crawl",
3-
"version": "3.2.9",
3+
"version": "3.2.10",
44
"author": "coderHXL",
55
"description": "x-crawl is a flexible nodejs crawler library.",
66
"license": "MIT",

0 commit comments

Comments
 (0)