Skip to content

Commit 0dcc856

Browse files
l2yshoclaude
andcommitted
docs: align enqueueLinks upgrade guide with unified include/exclude API
Update the v4 upgrade guide and sitemap example to reflect the globs/regexps/pseudoUrls -> include collapse, removal of PseudoUrl and per-pattern request options, and corrected transformRequestFunction precedence. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
1 parent 19707d6 commit 0dcc856

2 files changed

Lines changed: 27 additions & 4 deletions

File tree

docs/guides/request_loaders_sitemap_basic.ts

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ import { SitemapRequestLoader } from 'crawlee';
55
const sitemapRequestLoader = await SitemapRequestLoader.open({
66
sitemapUrls: ['https://crawlee.dev/sitemap.xml'],
77
// Optionally filter the URLs read from the sitemap:
8-
// globs: ['https://crawlee.dev/docs/**'],
8+
// include: ['https://crawlee.dev/docs/**'],
99
});
1010

1111
for await (const request of sitemapRequestLoader) {

docs/upgrading/upgrading_v4.md

Lines changed: 26 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -897,14 +897,37 @@ await enqueueLinks({ urls, requestQueue });
897897
await enqueueLinks({ urls, requestManager });
898898
```
899899

900+
### `globs`, `regexps`, and `pseudoUrls` replaced by `include`
901+
902+
To align with the Crawlee for Python API, the separate `globs`, `regexps`, and `pseudoUrls` URL-filtering options of `enqueueLinks()`, the click-elements enqueue helpers, and `SitemapRequestLoader` have been collapsed into a single `include` option (mirroring the already-unified `exclude` option). Each entry of `include`/`exclude` can be a glob string, a `RegExp`, or a `{ glob }` / `{ regexp }` object.
903+
904+
The `PseudoUrl` class is no longer exported and the `@apify/pseudo_url` dependency has been dropped. Rewrite any pseudo-URL patterns as globs or regular expressions.
905+
906+
Per-pattern request options (`label`, `userData`, `method`, `payload`, `headers` set directly on a pattern object) are no longer supported. Use the top-level `label` / `userData` options, or `transformRequestFunction`, to set request options for the enqueued requests.
907+
908+
**Before:**
909+
```typescript
910+
await enqueueLinks({
911+
globs: ['https://crawlee.dev/docs/**'],
912+
regexps: [/\/blog\//],
913+
pseudoUrls: ['https://crawlee.dev/[.*]'],
914+
});
915+
```
916+
917+
**After:**
918+
```typescript
919+
await enqueueLinks({
920+
include: ['https://crawlee.dev/docs/**', /\/blog\//, 'https://crawlee.dev/**'],
921+
});
922+
```
923+
900924
## `transformRequestFunction` precedence in `enqueueLinks`
901925

902-
The `transformRequestFunction` callback in `enqueueLinks` now runs **after** URL pattern filtering (`globs`, `regexps`, `pseudoUrls`) instead of before. This means it has the highest priority and can overwrite any request options set by patterns or the global `label` option.
926+
The `transformRequestFunction` callback in `enqueueLinks` now runs **after** URL pattern filtering (`include`, `exclude`) instead of before. This means it has the highest priority and can overwrite any request options set by the global `label` / `userData` options.
903927

904928
The priority order is now (lowest to highest):
905929
1. Global `label` / `userData` options
906-
2. Pattern-specific options from `globs`, `regexps`, or `pseudoUrls` objects
907-
3. `transformRequestFunction`
930+
2. `transformRequestFunction`
908931

909932
The `transformRequestFunction` callback receives a `RequestOptions` object and can return either:
910933
- The modified `RequestOptions` object

0 commit comments

Comments
 (0)