|
2 | 2 |
|
3 | 3 | English | [简体中文](https://github.com/coder-hxl/x-crawl/blob/main/docs/cn.md)
|
4 | 4 |
|
5 |
| -X-Crawl is a flexible Nodejs reptile bank. Used to crawl pages, batch network requests, and download file resources in batches. There are 5 kinds of RequestConfig writing, 3 ways to obtain results, and crawl data asynchronous or synchronized mode. Run on Nodejs and be friendly to JS/TS developers. |
| 5 | +x-crawl is a flexible nodejs crawler library. Used to crawl pages, batch network requests, and batch download file resources. Crawl data in asynchronous or synchronous mode, 3 ways to get results, and 5 ways to write requestConfig. Runs on nodejs, friendly to JS/TS developers. |
6 | 6 |
|
7 | 7 | If you feel good, you can support [x-crawl repository](https://github.com/coder-hxl/x-crawl) with a Star.
|
8 | 8 |
|
@@ -37,9 +37,9 @@ We can do the following:
|
37 | 37 | + [Choose crawling mode](#Choose-crawling-mode)
|
38 | 38 | + [Multiple crawler application instances](#Multiple-crawler-application-instances)
|
39 | 39 | * [Crawl page](#Crawl-page)
|
40 |
| - + [jsdom](#jsdom) |
41 |
| - + [browser](#browser) |
42 |
| - + [page](#page) |
| 40 | + + [jsdom instance](#jsdom-instance) |
| 41 | + + [browser instance](#browser-instance) |
| 42 | + + [page instance](#page-instance) |
43 | 43 | * [Crawl interface](#Crawl-interface)
|
44 | 44 | * [Crawl files](#Crawl-files)
|
45 | 45 | * [Start polling](#Start-polling)
|
@@ -212,19 +212,39 @@ myXCrawl.crawlPage('https://xxx.com').then(res => {
|
212 | 212 | })
|
213 | 213 | ```
|
214 | 214 |
|
215 |
| -#### jsdom |
| 215 | +#### jsdom instance |
216 | 216 |
|
217 | 217 | Refer to [jsdom](https://github.com/jsdom/jsdom) for specific usage.
|
218 | 218 |
|
219 |
| -#### browser |
| 219 | +#### browser instance |
220 | 220 |
|
221 |
| -**Purpose of calling close: **browser will keep running, so the file will not be terminated. Do not call [crawlPage](#crawlPage) or [page](#page) if you need to use it later. When you modify the properties of the browser object, it will affect the browser inside the crawlPage of the crawler instance, the returned page, and the browser, because the browser is shared within the crawlPage API of the crawler instance. |
| 221 | +The browser instance is a headless browser without a UI shell. What he does is to bring **all modern network platform functions** provided by the browser rendering engine to the code. |
| 222 | + |
| 223 | +**Purpose of calling close:** The browser instance will always be running internally, causing the file not to be terminated. Do not call [crawlPage](#crawlPage) or [page](#page) if you need to use it later. When you modify the properties of a browser instance, it will affect the browser instance inside the crawlPage API of the crawler instance, the page instance that returns the result, and the browser instance, because the browser instance is shared within the crawlPage API of the same crawler instance. |
222 | 224 |
|
223 | 225 | Refer to [browser](https://pptr.dev/api/puppeteer.browser) for specific usage.
|
224 | 226 |
|
225 |
| -#### page |
| 227 | +#### page instance |
| 228 | + |
| 229 | +**Take Screenshot** |
| 230 | + |
| 231 | +```js |
| 232 | +import xCrawl from 'x-crawl' |
| 233 | + |
| 234 | +const testXCrawl = xCrawl({ timeout: 10000 }) |
| 235 | + |
| 236 | +testXCrawl |
| 237 | + .crawlPage('https://xxx.com') |
| 238 | + .then(async (res) => { |
| 239 | + const { page } = res |
| 240 | + |
| 241 | + await page.screenshot({ path: './upload/page.png' }) |
| 242 | + |
| 243 | + console.log('Screen capture is complete') |
| 244 | + }) |
| 245 | +``` |
226 | 246 |
|
227 |
| -The page attribute can be used for interactive operations such as events. For details, refer to [page](https://pptr.dev/api/puppeteer.page). |
| 247 | +The page instance can also perform interactive operations such as events. For details, refer to [page](https://pptr.dev/api/puppeteer.page). |
228 | 248 |
|
229 | 249 | ### Crawl interface
|
230 | 250 |
|
|
0 commit comments