Skip to content

Conversation

@abhishekgarg18
Copy link
Contributor

…da timeouts

This commit introduces batch processing for the crawl-based broken internal links detection to prevent AWS Lambda 15-minute timeout issues on sites with 100+ pages.

Key Changes:

  • Batch Processing: Split crawl detection into batches of 30 pages per Lambda invocation
  • S3 State Management: Store batch results and URL caches in S3 for persistence across invocations
  • SQS Continuation: Chain Lambda invocations via SQS messages for seamless batch processing
  • Timeout Handling: Detect and handle HTTP request timeouts gracefully (skip GET fallback on HEAD timeout)
  • Increased Capacity: Raise MAX_URLS_TO_PROCESS from 100 to 500 to support larger audits
  • 100% Test Coverage: Comprehensive test suite covering all batching logic and edge cases

Architecture:

  • batch-state.js: S3-based state management utilities
  • crawl-detection.js: Batch processing logic with URL caching
  • handler.js: Batch orchestration with SQS self-looping
  • helpers.js: Enhanced timeout detection for link validation

Testing:

  • Added batch-state.test.js for S3 state management tests
  • Added crawl-detection.test.js for batch processing tests
  • Updated handler.test.js and helpers.test.js for full coverage
  • All tests passing with 100% line and branch coverage

This implementation ensures audits complete successfully even for large sites (500+ pages) by processing them in manageable batches across multiple Lambda invocations, with each invocation staying well under the 15-minute timeout limit.

Please ensure your pull request adheres to the following guidelines:

  • make sure to link the related issues in this description
  • when merging / squashing, make sure the fixed issue references are visible in the commits, for easy compilation of release notes
  • If data sources for any opportunity has been updated/added, please update the wiki for same opportunity.

Related Issues

Thanks for contributing!

Abhishek Garg added 2 commits January 22, 2026 00:13
…da timeouts

This commit introduces batch processing for the crawl-based broken internal links detection
to prevent AWS Lambda 15-minute timeout issues on sites with 100+ pages.

Key Changes:
- **Batch Processing**: Split crawl detection into batches of 30 pages per Lambda invocation
- **S3 State Management**: Store batch results and URL caches in S3 for persistence across invocations
- **SQS Continuation**: Chain Lambda invocations via SQS messages for seamless batch processing
- **Timeout Handling**: Detect and handle HTTP request timeouts gracefully (skip GET fallback on HEAD timeout)
- **Increased Capacity**: Raise MAX_URLS_TO_PROCESS from 100 to 500 to support larger audits
- **100% Test Coverage**: Comprehensive test suite covering all batching logic and edge cases

Architecture:
- `batch-state.js`: S3-based state management utilities
- `crawl-detection.js`: Batch processing logic with URL caching
- `handler.js`: Batch orchestration with SQS self-looping
- `helpers.js`: Enhanced timeout detection for link validation

Testing:
- Added batch-state.test.js for S3 state management tests
- Added crawl-detection.test.js for batch processing tests
- Updated handler.test.js and helpers.test.js for full coverage
- All tests passing with 100% line and branch coverage

This implementation ensures audits complete successfully even for large sites (500+ pages)
by processing them in manageable batches across multiple Lambda invocations, with each
invocation staying well under the 15-minute timeout limit.
@github-actions
Copy link

This PR will trigger a minor release when merged.

@codecov
Copy link

codecov bot commented Jan 21, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Abhishek Garg added 2 commits January 22, 2026 07:54
…da timeouts

- Implements S3-based batch state management to handle large page sets
- Adds SQS self-looping for continuation across multiple Lambda invocations
- Processes 30 pages per batch to prevent 15-minute Lambda timeout
- Increases MAX_URLS_TO_PROCESS from 100 to 500
- Reuses broken/working URL caches across batches for efficiency
- Optimizes HTTP request timeouts (skips redundant GET after HEAD timeout)
- Enables scraping cache by default (removed allowCache: false)
- Adds deployment verification logs to confirm batching code is active
- Achieves 100% test coverage for all internal-links modules
…ecat-audit-worker into feat/crawl-detection-batching
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants