fix(enqueueLinks)!: Default strategy now SameDomain#3095
fix(enqueueLinks)!: Default strategy now SameDomain#3095SalvadorN323 wants to merge 17 commits intoapify:masterfrom
Conversation
…ddress apify#2513 redirection bug.
…ddress apify#2513 redirection bug.
…me-domain This change updates the default behavior of the enqueueLinks function to use EnqueueStrategy.SameDomain instead of EnqueueStrategy.SameHostname. This resolves common issues where links to apex domains (e.g example.com) were filtered out when the crawler landed on a www subdomain (e.g www.example.com), and vice versa, aligning with typical website structures and expected crawling scope Closes apify#2513 BREAKING CHANGE: The default enqueueLinks strategy has shifted from SameHostname to SameDomain. Users who relied on the previous strict SameHostname behavior (only matching the exact hostname of the current page) must now explicitly set strategy: EnqueueStrategy.SameHostname in their enqueueLinks options to maintain that strictness. This change will cause crawlers to explore a broader set of URLs by default (including all subdomains of the root domain).
…me-domain This change updates the default behavior of the enqueueLinks function to use EnqueueStrategy.SameDomain instead of EnqueueStrategy.SameHostname. This resolves common issues where links to apex domains (e.g example.com) were filtered out when the crawler landed on a www subdomain (e.g www.example.com), and vice versa, aligning with typical website structures and expected crawling scope Closes apify#2513 BREAKING CHANGE: The default enqueueLinks strategy has shifted from SameHostname to SameDomain. Users who relied on the previous strict SameHostname behavior (only matching the exact hostname of the current page) must now explicitly set strategy: EnqueueStrategy.SameHostname in their enqueueLinks options to maintain that strictness. This change will cause crawlers to explore a broader set of URLs by default (including all subdomains of the root domain).
…me-domain This change updates the default behavior of the enqueueLinks function to use EnqueueStrategy.SameDomain instead of EnqueueStrategy.SameHostname. This resolves common issues where links to apex domains (e.g example.com) were filtered out when the crawler landed on a www subdomain (e.g www.example.com), and vice versa, aligning with typical website structures and expected crawling scope Closes apify#2513 BREAKING CHANGE: The default enqueueLinks strategy has shifted from SameHostname to SameDomain. Users who relied on the previous strict SameHostname behavior (only matching the exact hostname of the current page) must now explicitly set strategy: EnqueueStrategy.SameHostname in their enqueueLinks options to maintain that strictness. This change will cause crawlers to explore a broader set of URLs by default (including all subdomains of the root domain).
Co-authored-by: Alexander Manalad <154474635+axmanalad@users.noreply.github.com>
|
Hi @barjin - The team has implemented your suggestion for #2513 in this PR. Could you provide some guidance on if this meets your expectations and any potential next steps (considering this is a breaking change and the need for waiting for your release process for cutting a 4.x based main). The team also implemented another variant with a new strategy |
|
Hi @barjin - pinging you one more time with the hope that you have a few cycles to look this and #3098 over to provide some guidance/next steps. If you aren't the right person, we'd appreciate if you could recommend who would be. The team working on this officially disbands tomorrow but would like to see this through if possible. Thank you for your consideration. |
|
Hello, and sorry for the delay in our response. After discussing this further with the team, we decided we'd like to keep I'll close this PR now. Thank you for your contribution and your understanding! |
Summary
This PR introduces a significant change to the default behavior of the
enqueueLinksfunction, aligning it with more common crawling patterns and robustly handling common website redirections. The defaultenqueueStrategyhas been changed fromSameHostnametoSameDomain.Key Changes
enqueueLinksfunction's defaultstrategyis nowEnqueueStrategy.SameDomainwhen noglobs,regexps, orpseudoUrlsare explicitly provided.SameHostnamefiltering by default now correctly assertSameDomainbehavior (e.g., links acrosswww.and apex domains are now included by default).enqueueLinksdefaults toSameDomain(including subdomains).Breaking Change Details
The default
enqueueLinksstrategy has shifted fromEnqueueStrategy.SameHostnametoEnqueueStrategy.SameDomain.Before:
After:
SameHostnamebehavior, users must now explicitly setstrategy: EnqueueStrategy.SameHostnamein theirenqueueLinksoptions.Contributors:
Closes #2513