feat: add batching for internal links crawl detection to prevent Lamb… #1874

abhishekgarg18 · 2026-01-21T18:45:41Z

…da timeouts

This commit introduces batch processing for the crawl-based broken internal links detection to prevent AWS Lambda 15-minute timeout issues on sites with 100+ pages.

Key Changes:

Batch Processing: Split crawl detection into batches of 30 pages per Lambda invocation
S3 State Management: Store batch results and URL caches in S3 for persistence across invocations
SQS Continuation: Chain Lambda invocations via SQS messages for seamless batch processing
Timeout Handling: Detect and handle HTTP request timeouts gracefully (skip GET fallback on HEAD timeout)
Increased Capacity: Raise MAX_URLS_TO_PROCESS from 100 to 500 to support larger audits
100% Test Coverage: Comprehensive test suite covering all batching logic and edge cases

Architecture:

batch-state.js: S3-based state management utilities
crawl-detection.js: Batch processing logic with URL caching
handler.js: Batch orchestration with SQS self-looping
helpers.js: Enhanced timeout detection for link validation

Testing:

Added batch-state.test.js for S3 state management tests
Added crawl-detection.test.js for batch processing tests
Updated handler.test.js and helpers.test.js for full coverage
All tests passing with 100% line and branch coverage

This implementation ensures audits complete successfully even for large sites (500+ pages) by processing them in manageable batches across multiple Lambda invocations, with each invocation staying well under the 15-minute timeout limit.

Please ensure your pull request adheres to the following guidelines:

make sure to link the related issues in this description
when merging / squashing, make sure the fixed issue references are visible in the commits, for easy compilation of release notes
If data sources for any opportunity has been updated/added, please update the wiki for same opportunity.

Related Issues

Thanks for contributing!

…da timeouts This commit introduces batch processing for the crawl-based broken internal links detection to prevent AWS Lambda 15-minute timeout issues on sites with 100+ pages. Key Changes: - **Batch Processing**: Split crawl detection into batches of 30 pages per Lambda invocation - **S3 State Management**: Store batch results and URL caches in S3 for persistence across invocations - **SQS Continuation**: Chain Lambda invocations via SQS messages for seamless batch processing - **Timeout Handling**: Detect and handle HTTP request timeouts gracefully (skip GET fallback on HEAD timeout) - **Increased Capacity**: Raise MAX_URLS_TO_PROCESS from 100 to 500 to support larger audits - **100% Test Coverage**: Comprehensive test suite covering all batching logic and edge cases Architecture: - `batch-state.js`: S3-based state management utilities - `crawl-detection.js`: Batch processing logic with URL caching - `handler.js`: Batch orchestration with SQS self-looping - `helpers.js`: Enhanced timeout detection for link validation Testing: - Added batch-state.test.js for S3 state management tests - Added crawl-detection.test.js for batch processing tests - Updated handler.test.js and helpers.test.js for full coverage - All tests passing with 100% line and branch coverage This implementation ensures audits complete successfully even for large sites (500+ pages) by processing them in manageable batches across multiple Lambda invocations, with each invocation staying well under the 15-minute timeout limit.

…at/crawl-detection-batching

github-actions · 2026-01-21T18:45:48Z

This PR will trigger a minor release when merged.

codecov · 2026-01-21T18:49:01Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

…da timeouts - Implements S3-based batch state management to handle large page sets - Adds SQS self-looping for continuation across multiple Lambda invocations - Processes 30 pages per batch to prevent 15-minute Lambda timeout - Increases MAX_URLS_TO_PROCESS from 100 to 500 - Reuses broken/working URL caches across batches for efficiency - Optimizes HTTP request timeouts (skips redundant GET after HEAD timeout) - Enables scraping cache by default (removed allowCache: false) - Adds deployment verification logs to confirm batching code is active - Achieves 100% test coverage for all internal-links modules

…ecat-audit-worker into feat/crawl-detection-batching

Abhishek Garg added 2 commits January 22, 2026 00:13

Merge branch 'main' of github.com:adobe/spacecat-audit-worker into fe…

91200b2

…at/crawl-detection-batching

Abhishek Garg added 2 commits January 22, 2026 07:54

Merge branch 'feat/crawl-detection-batching' of github.com:adobe/spac…

74f6368

…ecat-audit-worker into feat/crawl-detection-batching

abhishekgarg18 requested review from Kanishkavijay39 and absarasw January 22, 2026 06:11

Merge branch 'main' into feat/crawl-detection-batching

23593ed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add batching for internal links crawl detection to prevent Lamb… #1874

feat: add batching for internal links crawl detection to prevent Lamb… #1874

abhishekgarg18 commented Jan 21, 2026

Uh oh!

github-actions bot commented Jan 21, 2026

Uh oh!

codecov bot commented Jan 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat: add batching for internal links crawl detection to prevent Lamb… #1874

Are you sure you want to change the base?

feat: add batching for internal links crawl detection to prevent Lamb… #1874

Conversation

abhishekgarg18 commented Jan 21, 2026

Related Issues

Uh oh!

github-actions bot commented Jan 21, 2026

Uh oh!

codecov bot commented Jan 21, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants