Skip to content

Commit 19707d6

Browse files
l2yshoclaude
andcommitted
Merge remote-tracking branch 'origin/3409-align-enqueuelinksoptions-with-crawlee-python'
Reconcile the local v4 merge with the rebased remote feature branch. The remote branch is based on an older v4 (IRequestList / RequestProvider / SitemapRequestList), while the local branch carries the latest v4 tip (IRequestLoader / IRequestManager / SitemapRequestLoader). Resolved all conflicts in favor of the newer v4 API, since the include/exclude feature logic is identical on both sides: - enqueue_links.ts, click-elements.ts (pw/pptr): keep IRequestManager import - sitemap_request_loader.ts: keep SitemapRequestLoader rename + IRequestLoader - sitemap_request_loader.test.ts: keep SitemapRequestLoader / new method names Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2 parents 43e932f + 2b65dc4 commit 19707d6

25 files changed

Lines changed: 29 additions & 25 deletions

File tree

docs/examples/crawl_some_links.mdx

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ import RunnableCodeBlock from '@site/src/components/RunnableCodeBlock';
77
import ApiLink from '@site/src/components/ApiLink';
88
import CrawlSource from '!!raw-loader!roa-loader!./crawl_some_links.ts';
99

10-
This <ApiLink to="cheerio-crawler/class/CheerioCrawler">`CheerioCrawler`</ApiLink> example uses the <ApiLink to="core/interface/EnqueueLinksOptions#globs">`globs`</ApiLink> property in the <ApiLink to="cheerio-crawler/interface/CheerioCrawlingContext#enqueueLinks">`enqueueLinks()`</ApiLink> method to only add links to the <ApiLink to="core/class/RequestQueue">`RequestQueue`</ApiLink> queue if they match the specified pattern.
10+
This <ApiLink to="cheerio-crawler/class/CheerioCrawler">`CheerioCrawler`</ApiLink> example uses the <ApiLink to="core/interface/EnqueueLinksOptions#include">`include`</ApiLink> property in the <ApiLink to="cheerio-crawler/interface/CheerioCrawlingContext#enqueueLinks">`enqueueLinks()`</ApiLink> method to only add links to the <ApiLink to="core/class/RequestQueue">`RequestQueue`</ApiLink> queue if they match the specified pattern.
1111

1212
<RunnableCodeBlock className="language-js" type="cheerio">
1313
{CrawlSource}

docs/introduction/03-adding-urls.mdx

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -130,7 +130,7 @@ await enqueueLinks({
130130

131131
### Filter URLs with patterns
132132

133-
For even more control, you can use `globs`, `regexps` and `pseudoUrls` to filter the URLs. Each of those arguments is always an `Array`, but the contents can take on many forms. <ApiLink to="core/interface/EnqueueLinksOptions">See the reference</ApiLink> for more information about them as well as other options.
133+
For even more control, you can use `include` and `exclude` to filter the URLs. Each accepts an `Array` of glob pattern strings, `{ glob: string }` objects, `RegExp` instances, or `{ regexp: RegExp }` objects. <ApiLink to="core/interface/EnqueueLinksOptions">See the reference</ApiLink> for more information about them as well as other options.
134134

135135
:::caution Defaults override
136136

test/e2e/adaptive-playwright-default/actor/main.js

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -40,7 +40,7 @@ const crawler = new AdaptivePlaywrightCrawler({
4040
await context.pushData({ url, heading, requestHandlerMode });
4141

4242
await context.enqueueLinks({
43-
globs: ['**/next/examples/*'],
43+
include: ['**/next/examples/*'],
4444
});
4545
},
4646
});

test/e2e/adaptive-playwright-robots-file/actor/main.js

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@ crawler.router.addDefaultHandler(async ({ log, request, enqueueLinks, pushData }
1818
log.info(`Processing ${request.loadedUrl}`);
1919
await enqueueLinks({
2020
// '/cart' is disallowed by robots.txt
21-
globs: ['**/cart', '**/collections/*'],
21+
include: ['**/cart', '**/collections/*'],
2222
});
2323
await pushData({ url: request.url, loadedUrl: request.loadedUrl });
2424
});

test/e2e/cheerio-default-ts/actor/main.ts

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@ const crawler = new CheerioCrawler();
1313
crawler.router.addDefaultHandler(async ({ $, enqueueLinks, request, log }) => {
1414
const { url } = request;
1515
await enqueueLinks({
16-
globs: ['https://crawlee.dev/js/docs/**'],
16+
include: ['https://crawlee.dev/js/docs/**'],
1717
});
1818

1919
const pageTitle = $('title').first().text();

test/e2e/cheerio-default/actor/main.js

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@ await Actor.main(async () => {
2020
async requestHandler({ $, enqueueLinks, request, log }) {
2121
const { url } = request;
2222
await enqueueLinks({
23-
globs: ['https://crawlee.dev/js/docs/**'],
23+
include: ['https://crawlee.dev/js/docs/**'],
2424
});
2525

2626
const pageTitle = $('title').first().text();

test/e2e/cheerio-enqueue-links-base/actor/main.js

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ await Actor.main(async () => {
2121
await Dataset.pushData({ url, loadedUrl, pageTitle });
2222

2323
await enqueueLinks({
24-
globs: [
24+
include: [
2525
'https://www.jamesallen.com/about-us/**',
2626
'https://www.jamesallen.com/terms-of-use/**',
2727
'https://www.jamesallen.com/guarantee/**',

test/e2e/cheerio-ignore-ssl-errors/actor/main.js

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,8 @@ await Actor.main(async () => {
2121
if (label === 'START') {
2222
log.info('Bad ssl page opened!');
2323
await enqueueLinks({
24-
globs: [{ glob: 'https://*.badssl.com/', userData: { label: 'DETAIL' } }],
24+
include: ['https://*.badssl.com/'],
25+
label: 'DETAIL',
2526
selector: '.group a.bad',
2627
});
2728
} else if (label === 'DETAIL') {

test/e2e/cheerio-page-info/actor/main.js

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ const router = createCheerioRouter();
1414
router.addHandler('START', async ({ enqueueLinks }) => {
1515
await enqueueLinks({
1616
label: 'DETAIL',
17-
globs: ['**/examples/accept-user-input'],
17+
include: ['**/examples/accept-user-input'],
1818
});
1919
});
2020

test/e2e/cheerio-request-queue-v2/actor/main.js

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ await Actor.main(async () => {
1414
async requestHandler({ $, enqueueLinks, request, log }) {
1515
const { url } = request;
1616
await enqueueLinks({
17-
globs: ['https://crawlee.dev/js/docs/**'],
17+
include: ['https://crawlee.dev/js/docs/**'],
1818
});
1919

2020
const pageTitle = $('title').first().text();

0 commit comments

Comments
 (0)