Skip to content

fix(enqueueLinks)!: Default strategy now SameDomain#3095

Closed
SalvadorN323 wants to merge 17 commits intoapify:masterfrom
axmanalad:switch-default-strat
Closed

fix(enqueueLinks)!: Default strategy now SameDomain#3095
SalvadorN323 wants to merge 17 commits intoapify:masterfrom
axmanalad:switch-default-strat

Conversation

@SalvadorN323
Copy link
Copy Markdown

@SalvadorN323 SalvadorN323 commented Jul 24, 2025

Summary

This PR introduces a significant change to the default behavior of the enqueueLinks function, aligning it with more common crawling patterns and robustly handling common website redirections. The default enqueueStrategy has been changed from SameHostname to SameDomain.

Key Changes

  • Code Change (Breaking Change):
    • The enqueueLinks function's default strategy is now EnqueueStrategy.SameDomain when no globs, regexps, or pseudoUrls are explicitly provided.
  • Unit Test Updates:
    • Tests that previously expected SameHostname filtering by default now correctly assert SameDomain behavior (e.g., links across www. and apex domains are now included by default).
  • Documentation Updates:
    • Updated "Adding more URLs" Guide: The guide now explicitly states that enqueueLinks defaults to SameDomain (including subdomains).

Breaking Change Details

The default enqueueLinks strategy has shifted from EnqueueStrategy.SameHostname to EnqueueStrategy.SameDomain.

Before:

options.strategy ??= EnqueueStrategy.SameHostname

After:

options.strategy ??= EnqueueStrategy.SameDomain
  • Mitigation: To maintain the previous strict SameHostname behavior, users must now explicitly set strategy: EnqueueStrategy.SameHostname in their enqueueLinks options.
// If you want the old strict 'same-hostname' behavior after this update:
await enqueueLinks({
    strategy: 'same-hostname',
});

Contributors:

Closes #2513

SalvadorN and others added 17 commits July 17, 2025 13:14
…me-domain

This change updates the default behavior of the enqueueLinks function to use EnqueueStrategy.SameDomain instead of EnqueueStrategy.SameHostname. This resolves common issues where links to apex domains (e.g example.com) were filtered out when the crawler landed on a www subdomain (e.g www.example.com), and vice versa, aligning with typical website structures and expected crawling scope

Closes apify#2513

BREAKING CHANGE: The default enqueueLinks strategy has shifted from SameHostname to SameDomain. Users who relied on the previous strict SameHostname behavior (only matching the exact hostname of the current page) must now explicitly set strategy: EnqueueStrategy.SameHostname in their enqueueLinks options to maintain that strictness. This change will cause crawlers to explore a broader set of URLs by default (including all subdomains of the root domain).
…me-domain

This change updates the default behavior of the enqueueLinks function to use EnqueueStrategy.SameDomain instead of EnqueueStrategy.SameHostname. This resolves common issues where links to apex domains (e.g example.com) were filtered out when the crawler landed on a www subdomain (e.g www.example.com), and vice versa, aligning with typical website structures and expected crawling scope

Closes apify#2513

BREAKING CHANGE: The default enqueueLinks strategy has shifted from SameHostname to SameDomain. Users who relied on the previous strict SameHostname behavior (only matching the exact hostname of the current page) must now explicitly set strategy: EnqueueStrategy.SameHostname in their enqueueLinks options to maintain that strictness. This change will cause crawlers to explore a broader set of URLs by default (including all subdomains of the root domain).
…me-domain

This change updates the default behavior of the enqueueLinks function to use EnqueueStrategy.SameDomain instead of EnqueueStrategy.SameHostname. This resolves common issues where links to apex domains (e.g example.com) were filtered out when the crawler landed on a www subdomain (e.g www.example.com), and vice versa, aligning with typical website structures and expected crawling scope

Closes apify#2513

BREAKING CHANGE: The default enqueueLinks strategy has shifted from SameHostname to SameDomain. Users who relied on the previous strict SameHostname behavior (only matching the exact hostname of the current page) must now explicitly set strategy: EnqueueStrategy.SameHostname in their enqueueLinks options to maintain that strictness. This change will cause crawlers to explore a broader set of URLs by default (including all subdomains of the root domain).
Co-authored-by: Alexander Manalad <154474635+axmanalad@users.noreply.github.com>
@pselvana
Copy link
Copy Markdown

pselvana commented Aug 7, 2025

Hi @barjin - The team has implemented your suggestion for #2513 in this PR. Could you provide some guidance on if this meets your expectations and any potential next steps (considering this is a breaking change and the need for waiting for your release process for cutting a 4.x based main).

The team also implemented another variant with a new strategy same-domain in case a breaking change was unwanted -- #3098

@pselvana
Copy link
Copy Markdown

Hi @barjin - pinging you one more time with the hope that you have a few cycles to look this and #3098 over to provide some guidance/next steps. If you aren't the right person, we'd appreciate if you could recommend who would be. The team working on this officially disbands tomorrow but would like to see this through if possible. Thank you for your consideration.

@barjin
Copy link
Copy Markdown
Member

barjin commented Aug 26, 2025

Hello, and sorry for the delay in our response.

After discussing this further with the team, we decided we'd like to keep SameHostname as the default strategy for Crawlee v4. The main argument against switching to SameDomain is avoiding accidental crawler runaway - if the user starts with e.g. blog.apify.com, they likely don't want to crawl console.apify.com and other similar subdomains.

I'll close this PR now. Thank you for your contribution and your understanding!

@barjin barjin closed this Aug 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug? PlaywrightCrawler enqueueLinks fails after WWW redirect.

4 participants