Update: Docs

coder-hxl · coder-hxl · commit e9f4dba3f584 · 2023-03-21T09:45:56.000+08:00
diff --git a/README.md b/README.md
@@ -2,30 +2,24 @@
 
 English | [简体中文](https://github.com/coder-hxl/x-crawl/blob/main/docs/cn.md)
 
-x-crawl is a flexible nodejs crawler library. You can crawl pages and control operations such as pages, batch network requests, and batch downloads of file resources. Support asynchronous/synchronous mode crawling data. Running on nodejs, the usage is flexible and simple, friendly to JS/TS developers.
+x-crawl is a flexible nodejs crawler library. It can crawl pages, control pages, batch network requests, batch download file resources, polling and crawling, etc. Support asynchronous/synchronous mode crawling data. Running on nodejs, the usage is flexible and simple, friendly to JS/TS developers.
 
 > If you feel good, you can give [x-crawl repository](https://github.com/coder-hxl/x-crawl) a Star to support it, your Star will be the motivation for my update.
 
 ## Features
 
 - Support asynchronous/synchronous way to crawl data.
-- Flexible writing, support a variety of ways to write request configuration and obtain crawl results.
-- Flexible crawling interval, up to you to use/avoid high concurrent crawling.
-- With simple configuration, operations such as crawling pages, batch network requests, and batch download of file resources can be performed.
-- Possess polling function to crawl data regularly.
-- The built-in puppeteer crawls the page, and uses the jsdom library to analyze the content of the page, and also supports self-analysis.
-- Capture the success and failure of the climb and highlight the reminder.
+- Flexible writing, supporting multiple ways to write request configuration and obtain crawling results.
+- Flexible crawling interval, no interval/fixed interval/random interval, it is up to you to use/avoid high concurrent crawling.
+- Simple configuration can crawl pages, batch network requests, batch download file resources, polling and crawling, etc.
+- Crawl SPA (single-page application) to generate pre-rendered content (ie "SSR" (server-side rendering)), and use jsdom library to parse the content, and also supports self-parsing.
+- Form submissions, keystrokes, event actions, screenshots of generated pages, etc.
+- Capture and record the success and failure of crawling, and highlight the reminders.
 - Written in TypeScript, has types, provides generics.
 
-## Relationship with puppeteer 
+## Relationship with puppeteer
 
-The crawlPage API internally uses the [puppeteer](https://github.com/puppeteer/puppeteer) library to help us crawl pages.
-
-The return value of the crawlPage API will be able to do the following:
-
-- Generate screenshots and PDFs of pages.
-- Crawl a SPA (Single-Page Application) and generate pre-rendered content (i.e. "SSR" (Server-Side Rendering)).
-- Automate form submission, UI testing, keyboard input, etc.
+The crawlPage API internally uses the [puppeteer](https://github.com/puppeteer/puppeteer) library to help us crawl pages and expose Brower instances and Page instances, making it more flexible.
 
 # Table of Contents
 
@@ -91,7 +85,7 @@ npm install x-crawl
 
 ## Example
 
-Regular crawling: Get the recommended pictures of the youtube homepage every other day as an example:
+Timing capture: Take the automatic capture of the cover image of Airbnb Plus listings every day as an example:
 
 ```js
 // 1.Import module ES/CJS
@@ -105,23 +99,18 @@ const myXCrawl = xCrawl({
 
 // 3.Set the crawling task
 // Call the startPolling API to start the polling function, and the callback function will be called every other day
-myXCrawl.startPolling({ d: 1 }, () => {
-  // Call crawlPage API to crawl Page
-  myXCrawl.crawlPage('https://www.youtube.com/').then((res) => {
-    const { browser, jsdom } = res // By default, the JSDOM library is used to parse Page
+myXCrawl.startPolling({ d: 1 }, (count, stopPolling) => {
+  myXCrawl.crawlPage('https://zh.airbnb.com/s/*/plus_homes').then((res) => {
+    const { jsdom } = res // By default, the JSDOM library is used to parse Page
 
-    // Get the cover image element of the Promoted Video
-    const imgEls = jsdom.window.document.querySelectorAll(
-      '.yt-core-image--fill-parent-width'
-    )
+    // Get the cover image elements for Plus listings
+    const imgEls = jsdom.window.document
+      .querySelector('.a1stauiv')
+      ?.querySelectorAll('picture img')
 
     // set request configuration
-    const requestConfig = []
-    imgEls.forEach((item) => {
-      if (item.src) {
-        requestConfig.push(item.src)
-      }
-    })
+    const requestConfig: string[] = []
+    imgEls?.forEach((item) => requestConfig.push(item.src))
 
     // Call the crawlFile API to crawl pictures
     myXCrawl.crawlFile({ requestConfig, fileConfig: { storeDir: './upload' } })
diff --git a/assets/en/crawler-result.png b/assets/en/crawler-result.png
diff --git a/assets/en/crawler.png b/assets/en/crawler.png
diff --git a/docs/cn.md b/docs/cn.md
@@ -2,30 +2,24 @@
 
 [English](https://github.com/coder-hxl/x-crawl#x-crawl) | 简体中文
 
-x-crawl 是一个灵活的 nodejs 爬虫库。可以爬取页面并控制页面、批量网络请求以及批量下载文件资源等操作。支持 异步/同步 模式爬取数据。跑在 nodejs 上，用法灵活和简单，对 JS/TS 开发者友好。
+x-crawl 是一个灵活的 nodejs 爬虫库。可以爬取页面、控制页面、批量网络请求、批量下载文件资源、轮询爬取等。支持 异步/同步 模式爬取数据。跑在 nodejs 上，用法灵活和简单，对 JS/TS 开发者友好。
 
 > 如果感觉不错，可以给 [x-crawl 存储库](https://github.com/coder-hxl/x-crawl) 点个 Star 支持一下，您的 Star 将是我更新的动力。
 
 ## 特征
 
 - 支持 异步/同步 方式爬取数据。
 - 灵活的写法，支持多种方式写请求配置和获取爬取结果。
-- 灵活的爬取间隔时间，由你决定 使用/避免 高并发爬取。
-- 简单的配置即可抓取页面、批量网络请求以及批量下载文件资源等操作。
-- 拥有轮询功能，定时爬取数据。
-- 内置 puppeteer 爬取页面，并用采用 jsdom 库对页面内容解析，也支持自行解析。
+- 灵活的爬取间隔时间，无间隔/固定间隔/随机间隔，由你决定 使用/避免 高并发爬取。
+- 简单的配置即可抓取页面、批量网络请求、批量下载文件资源、轮询爬取等。
+- 爬取 SPA（单页应用程序）生成预渲染内容（即“SSR”（服务器端渲染）），并采用 jsdom 库对内容解析，也支持自行解析。
+- 表单提交、键盘输入、事件操作、生成页面的屏幕截图等。
 - 对爬取的成功和失败进行捕获记录，并进行高亮的提醒。
 - 使用 TypeScript 编写，拥有类型，提供泛型。
 
 ## 跟 puppeteer 的关系
 
-crawlPage API 内部使用 [puppeteer](https://github.com/puppeteer/puppeteer) 库来帮助我们爬取页面。
-
-crawlPage API 的返回值将可以做以下操作:
-
-- 生成页面的屏幕截图和 PDF。
-- 抓取 SPA（单页应用程序）并生成预渲染内容（即“SSR”（服务器端渲染））。
-- 自动化表单提交、UI 测试、键盘输入等。
+crawlPage API 内部使用 [puppeteer](https://github.com/puppeteer/puppeteer) 库来帮助我们爬取页面，并将 Brower 实例和 Page 实例暴露出来，更加灵活。
 
 # 目录
 
diff --git a/package.json b/package.json
@@ -1,7 +1,7 @@
 {
   "private": true,
   "name": "x-crawl",
-  "version": "3.2.9",
+  "version": "3.2.10",
   "author": "coderHXL",
   "description": "x-crawl is a flexible nodejs crawler library.",
   "license": "MIT",
diff --git a/publish/README.md b/publish/README.md
@@ -2,30 +2,24 @@
 
 English | [简体中文](https://github.com/coder-hxl/x-crawl/blob/main/docs/cn.md)
 
-x-crawl is a flexible nodejs crawler library. You can crawl pages and control operations such as pages, batch network requests, and batch downloads of file resources. Support asynchronous/synchronous mode crawling data. Running on nodejs, the usage is flexible and simple, friendly to JS/TS developers.
+x-crawl is a flexible nodejs crawler library. It can crawl pages, control pages, batch network requests, batch download file resources, polling and crawling, etc. Support asynchronous/synchronous mode crawling data. Running on nodejs, the usage is flexible and simple, friendly to JS/TS developers.
 
 > If you feel good, you can give [x-crawl repository](https://github.com/coder-hxl/x-crawl) a Star to support it, your Star will be the motivation for my update.
 
 ## Features
 
 - Support asynchronous/synchronous way to crawl data.
-- Flexible writing, support a variety of ways to write request configuration and obtain crawl results.
-- Flexible crawling interval, up to you to use/avoid high concurrent crawling.
-- With simple configuration, operations such as crawling pages, batch network requests, and batch download of file resources can be performed.
-- Possess polling function to crawl data regularly.
-- The built-in puppeteer crawls the page, and uses the jsdom library to analyze the content of the page, and also supports self-analysis.
-- Capture the success and failure of the climb and highlight the reminder.
+- Flexible writing, supporting multiple ways to write request configuration and obtain crawling results.
+- Flexible crawling interval, no interval/fixed interval/random interval, it is up to you to use/avoid high concurrent crawling.
+- Simple configuration can crawl pages, batch network requests, batch download file resources, polling and crawling, etc.
+- Crawl SPA (single-page application) to generate pre-rendered content (ie "SSR" (server-side rendering)), and use jsdom library to parse the content, and also supports self-parsing.
+- Form submissions, keystrokes, event actions, screenshots of generated pages, etc.
+- Capture and record the success and failure of crawling, and highlight the reminders.
 - Written in TypeScript, has types, provides generics.
 
-## Relationship with puppeteer 
+## Relationship with puppeteer
 
-The crawlPage API internally uses the [puppeteer](https://github.com/puppeteer/puppeteer) library to help us crawl pages.
-
-The return value of the crawlPage API will be able to do the following:
-
-- Generate screenshots and PDFs of pages.
-- Crawl a SPA (Single-Page Application) and generate pre-rendered content (i.e. "SSR" (Server-Side Rendering)).
-- Automate form submission, UI testing, keyboard input, etc.
+The crawlPage API internally uses the [puppeteer](https://github.com/puppeteer/puppeteer) library to help us crawl pages and expose Brower instances and Page instances, making it more flexible.
 
 # Table of Contents
 
@@ -91,7 +85,7 @@ npm install x-crawl
 
 ## Example
 
-Regular crawling: Get the recommended pictures of the youtube homepage every other day as an example:
+Timing capture: Take the automatic capture of the cover image of Airbnb Plus listings every day as an example:
 
 ```js
 // 1.Import module ES/CJS
@@ -105,23 +99,18 @@ const myXCrawl = xCrawl({
 
 // 3.Set the crawling task
 // Call the startPolling API to start the polling function, and the callback function will be called every other day
-myXCrawl.startPolling({ d: 1 }, () => {
-  // Call crawlPage API to crawl Page
-  myXCrawl.crawlPage('https://www.youtube.com/').then((res) => {
-    const { browser, jsdom } = res // By default, the JSDOM library is used to parse Page
+myXCrawl.startPolling({ d: 1 }, (count, stopPolling) => {
+  myXCrawl.crawlPage('https://zh.airbnb.com/s/*/plus_homes').then((res) => {
+    const { jsdom } = res // By default, the JSDOM library is used to parse Page
 
-    // Get the cover image element of the Promoted Video
-    const imgEls = jsdom.window.document.querySelectorAll(
-      '.yt-core-image--fill-parent-width'
-    )
+    // Get the cover image elements for Plus listings
+    const imgEls = jsdom.window.document
+      .querySelector('.a1stauiv')
+      ?.querySelectorAll('picture img')
 
     // set request configuration
-    const requestConfig = []
-    imgEls.forEach((item) => {
-      if (item.src) {
-        requestConfig.push(item.src)
-      }
-    })
+    const requestConfig: string[] = []
+    imgEls?.forEach((item) => requestConfig.push(item.src))
 
     // Call the crawlFile API to crawl pictures
     myXCrawl.crawlFile({ requestConfig, fileConfig: { storeDir: './upload' } })
diff --git a/publish/package.json b/publish/package.json
@@ -1,6 +1,6 @@
 {
   "name": "x-crawl",
-  "version": "3.2.9",
+  "version": "3.2.10",
   "author": "coderHXL",
   "description": "x-crawl is a flexible nodejs crawler library.",
   "license": "MIT",

Original file line number	Diff line number	Diff line change
`@@ -1,7 +1,7 @@`
`1`	`1`	`{`
`2`	`2`	`"private": true,`
`3`	`3`	`"name": "x-crawl",`
`4`		`- "version": "3.2.9",`
	`4`	`+ "version": "3.2.10",`
`5`	`5`	`"author": "coderHXL",`
`6`	`6`	`"description": "x-crawl is a flexible nodejs crawler library.",`
`7`	`7`	`"license": "MIT",`
Original file line number	Diff line number	Diff line change
`@@ -1,6 +1,6 @@`
`1`	`1`	`{`
`2`	`2`	`"name": "x-crawl",`
`3`		`- "version": "3.2.9",`
	`3`	`+ "version": "3.2.10",`
`4`	`4`	`"author": "coderHXL",`
`5`	`5`	`"description": "x-crawl is a flexible nodejs crawler library.",`
`6`	`6`	`"license": "MIT",`