feat: token bucket rate limiting #1094

stevenh · 2025-05-10T09:31:20Z

Summary

Replace the existing rate limiting mechanism with a token bucket algorithm to improve request handling efficiency and control.

The previous implementation used a simple last request and current delay calculation, which could lead to uneven request distribution when multiple requests were made in quick succession. The result of this was the more links discovered by as single request the more likely it was that we would trigger 429 and 503 response codes, which combined with max_retries would cause the crawler to fail to successfully process all pages.

The new implementation uses a token bucket algorithm, which allows for more flexible and efficient rate limiting. The token bucket algorithm allows for bursts of requests while maintaining an average rate over time. This combined with active rate limit header processing results in a much higher success rate for crawls.

In addition the option to disable max retry limit, the new default, has been tested to confirm 100% success rate for crawls on a site which performs active rate limiting and previously resulted in a large number of failures.

Finally the rate limiter starts with default rate which will be adjusted per domain based on the rate limit headers returned by the server.

How Has This Been Tested?

New set of tests added to tests/test_async_dispatcher.py

Checklist:

My code follows the style guidelines of this project
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
I have added/updated unit tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Fix the handling of file downloads in AsyncPlaywrightCrawlerStrategy which wasn't waiting for the download to complete before returning, which resulted in race conditions and incomplete or missing downloads. Fix incorrect field use for headers and user_agent in AsyncPlaywrightCrawlerStrategy. Fix the type used for the logger in AsyncPlaywrightCrawlerStrategy to the base class Logger. This allows the logger to be used correctly and prevents type errors when using the logger. Add missing abstract methods to AsyncCrawlerStrategy and implement in its subclasses. Fix the extraction of the error message when processing console error log in log_console. Fix the handling of js_code in AsyncPlaywrightCrawlerStrategy to ensure that the code is executed correctly and the result is returned when the script starts with a variable or constant declaration. Remove unused time variables and commented out code to improve readability. Remove duplicate Exception handling. Fix the handling of raw requests in AsyncHTTPCrawlerStrategy which was previously using the parsed path corrupting the request. Fix the browser manager to ensure that it always returns a valid browser instance. This eliminates the class level playwright instance which wasn't being cleaned up on close, resulting in subsequent callers receiving a closed instance and causing errors. Fix the storage of links, media and metadata to ensure that the correct values are stored and returned. This prevents incorrect results when using the cached results. Use Field for default values in Media, Links and ScrapingResult pydantic models to prevent invalid results. Handle undefined string passed to MediaItem width. Add parameters to the CrawlResult.__init__ method to provide type hints for the fields, which provides better type checking. Fix database query retry handling for the case where the database table is missing when using the SQLite database. This prevents the crawler from failing when the table is not present and allows it to continue crawling. This prevents failing tests after one which drops the database table. Remove commit call for query's which don't need it. Sync table definition in legacy code so prevent mixed legacy and new code tests from conflicting with each other. Fix the caching of markdown field in DB / files which was only storing the single value, which caused failures when using cached results. Export the markdown field in StringCompatibleMarkdown, so we don't need to use a private field to ensure that the value is serialised correctly. Correctly initialise BrowserConfig and CrawlerRunConfig from kwargs to ensure legacy parameters work as expected.

Add relevant tests for the following: - markdown generator - http crawler strategy

Add ruff settings to pyproject.toml to prevent deprecation warnings beyond our control and to disable formatting for now to avoid the noise it would generate.

Fix a large number of issues with the core including config serialisation and crawl result handling, which were all identified during work to ensure that all the existing tests pass consistently. This includes: ------- fix: config serialisation Fix config serialisation by creating a new Serialisable type and adding missing module imports for ScoringStats and Logger. This allows the config to be serialised and deserialised correctly. Add missing initialisation for ScoringStats. Add missing stats parameter to URLScorer and all its subclasses to ensure that the stats are serialisable. ------- fix: parameter definitions, type hints and defaults Fix parameter definitions by adding missing Optional hit to those which default to None. Add type hints to improve linting validation and IDE support. Set default values for parameters which were missing them, ensuring type safety and preventing runtime errors. ------- fix: various comment typos Fix various typos in comments and doc strings to improve clarity. ------- fix: BaseDispatcher missing abstract methods Add missing abstract methods to BaseDispatcher and implement in subclasses. ------- fix: crawl result handling Fix the handling of crawl results, which were using inconsistent types. This now uses CrawlResultContainer for all crawl results, unwrapping as needed when performing deep crawls. This moves CrawlResultContainer into models ensuring it can be imported where needed, avoiding circular imports. Refactor CrawlResultContainer to subclass CrawlResult to provide type hinting in the single result case and ensure consistent handling of both synchronous and asynchronous results. ------- feat: implement run_urls_stream for SemaphoreDispatcher Implement run_urls_stream for SemaphoreDispatcher to allow streaming of URLs to be crawled. ------- chore: translate non english comments Translate non english comments to english to improve readability and maintainability for non native speakers. ------- chore: additional example for arun Add examples for arun to demonstrate usage of streamed and batch processing modes, clarifying the impact of deep crawl on the results. ------- fix: handling of CrawlerRunConfig Fix the handling of CrawlerRunConfig to ensure that the config is correctly initialised from legacy kwargs. ------- fix: invalid screenshot argument to aprocess_html Fix failure caused by generating a screenshot due to unsupported argument to aprocess_html. ------- fix: structured content extraction Fix structured content extraction not being run when NoExtractionStrategy is used. This ensures that the content is extracted correctly and returned in the crawl result. ------- fix: aclear_cache Fix aclear_cache, previously it was just clean up, which just closed active connections. It now calls aclear_db to remove all entries from the cache table. ------- fix: unused imports Remove unused imports to improve readability and reduce clutter. ------- fix: undefined filter_conf Fix use of undefined filter_conf in crawl_cmd when the config is not passed in and neither markdown-fix nor md-fit are specified as output options. ------- fix: bs4 imports Fix bs4 imports to use the correct module name and ensure that the correct classes are imported. This addresses linter errors. ------- fix: BM25Okapi idf calculation Fix the idf calculation in BM25Okapi to use the correct formula and ensure that the idf is calculated correctly. This prevents missing results when using BM25Okapi caused by zero idf values. Removed commented out code to improve readability. Eliminate unnecessary tag_weight calculation when score is 0. ------- fix: relevant content filter Fix the extraction of headers from the response, previously this only handled h1 tags, this now handles all header tags. Fix boundary check for the relevant content filter to ensure that the content is not excluded when the end aligns with 150 character limit. ------- fix: return type for extract_text_chunks Fix the return type for extract_text_chunks to be include the right types and values, so consumers know what to expect. ------- fix: invalid markdown parameter to ScrapingResult Remove all references to markdown for ScrapingResult as the source never returns markdown, so its pointless to include it in the result. ------- fix: closest parent description Clean unnecessary white space the description returned by find_closest_parent_with_useful_text so that the two different strategies return consistent results. ------- fix: potential sources for element extraction Fix the potential sources for element extraction to ensure that the srcset and data-lazy-src are processed instead of srcssetdata-lazy-src due to missing comma. ------- fix: data validation in web scraping Add missing data validation, ensuring that the correct types and only set values are processed. ------- fix: missing json import for AmazonProductCrawler Add missing json import for AmazonProductCrawler. ------- fix: GoogleSearchCrawler checks and errors Correct the result field used to report the error message when arun fails. This ensures that the error message is correctly reported. Add missing check for result.js_execution_result before use. Add missing check for blank cleaned_html. Add validation of cleaned_html to ensure that the value is correctly converted from bytes to str. ------- fix: abstract method definitions Fix the abstract method definitions to ensure that the correct types are used for async generators and iterators. This lint errors for incompatible overrides by their implementers. This is cause be the type being adjusted when a yield is present. ------- fix: invalid use of infinity for int values Fix the use of infinity for int values flagged by the linter. Instead we use -1 to indicate that the value is not set. Fix use of unset url parameter, reported by linter. ------- chore: remove unneeded loop on batch results Remove unnecessary duplicate loop on batch results in _arun_batch to calculate _pages_crawled. ------- fix: invalid use of lambda to define async function Replace the use of lambda to define an async function with a normal function definition. ------- fix: validate use of deep_crawl_strategy before use Validate the type of deep_crawl_strategy before use to ensure that the correct type is used and preventing lint errors on methods calls using it. ------- fix: unprocessed FilterChain tasks Ensure that all tasks in the FilterChain are either processed and returned or cancelled if not needed, preventing runtime warnings. ------- feat: add support for transport to Crawl4aiDockerClient Add the ability to provide a custom transport to Crawl4aiDockerClient which allows easy testing. Set base_url on the httpx.AsyncClient to avoid need for local variable and calculations, simplifying the code. ------- fix: Crawl4aiDockerClient.crawl results on error Correctly handle the async data returned by the crawl method in Crawl4aiDockerClient when the stream results in a non 200 response. ------- fix: linter errors for lxml imports Fix linter errors for lxml imports by using importing the required methods directly. ------- fix: use of unset llm_config Fix the use of unset llm_config in generate_schema to ensure that we don't get a runtime error. ------- fix: missing meta field for hub crawlers Support both ways that meta data is defined by hub crawlers. ------- fix: correct return type for get_cached_url Correct the type and number of returned values for get_cached_url. ------- fix: DocsManager generation Fix the generation performed by DocsManager now docs have a hierarchical structure. ------- fix: linting errors in calculate_semaphore_count Ensure we set a default for the resource values returned by os methods. ------- fix: port handling in get_base_domain Maintain the port for non default ports in get_base_domain to ensure that the correct domain is returned. This prevents local links being incorrectly classified as external. ------- fix: get_title method type Add static method decorator to get_title method to ensure that the method is correctly identified as a static method and not an instance method. This prevents lint errors and ensures that the method is correctly called. ------- chore: add project settings to improve developer experience Add details to pyproject.toml to improve the developer experience including: * Configuring pytest test timeouts, ignoring external warnings and asyncio scope. * Disabling ruff formatting * Creating developer package targets: dev, docker and test * Leverage chaining to simplify maintenance of the all group ------- fix: WebScrapingStrategy ascrap Fix the ascrap command by calling scrap instead of _scrap which misses a lot of the functionality. ------- fix: scraping local sites Fix the scraping of local sites by removing the check for dot in parsed network location. ------- fix: monitor support on pseudo terminals Fix monitor support on pseudo terminals by setting the default for enable_ui to the value of sys.stdin.isatty(), this ensures it works under pytest if not specifically set. ------- fix: imports Add missing and remove unused imports as well as eliminating the use of wildcard imports. ------- fix: dependencies Add missing optional dependencies. ------- chore: remove deprecated licence tags Update setup.py and pyproject.toml to remove deprecated license tags, bumping the dependency to support the new method. The eliminates toml lint warning in pyproject.toml. ------- fix: delay on browser close Wait for spawned browser process, preventing it from becoming defunct and delaying the clean up until max retries have been reached. ------- fix: invalid use of firefox via cdp Remove firefox from the list of supported internal browsers as it doesn't support CDP via playwright. ------- feat: add support for ephemeral debug port Add support for using an ephemeral debug port for the browser, eliminating the need to set a specific port and preventing port conflicts. ------- core: add TODOs for issues noticed Add TODOs for issues noticed during implementation of these fixes but not needed to address test failures of the code.

Fix tests to run fully under pytest, this includes: - Fixing filenames, removing dots and correct typos - Removing __init__ methods, which are not supported by pytest - Implement parametrisation so tests can be run individually - Added timeouts so tests can't run forever - Replacing print and logging with assertions to prevent false successes - Removing unused and add missing imports - Mark tests with @pytest.mark.asyncio where appropriate - Use http constants to avoid magic numbers - Add type hints to improve linting and identify issues - Use local server for API tests to improve debugging and eliminate docker dependency - Call pytest in __main__ to allow running tests from command line - Skip broken tests - Fix out of date logic and invalid method parameters - Re-enable disabled and commented out tests after fixing them - Added missing test data - Updated tests that depend on altered external css or html structure - Automatically skip if tests if API key is not set If you need to debug a test, which will take time, you will need to comment out the default timeout in pyproject.toml under [tool.pytest.ini_options].

Re-raise RateLimitError if retries are exhausted in perform_completion_with_backoff, which fixes issues with callers requesting usage information when the model is rate limited. Also eliminate unnecessary catch of Exception and clean up the flow of the function. Fixes: unclecode#989

Fix BFSDeepCrawlStrategy processing URLs that vary based on base domain or port so they only process once. The common case for this is www.example.com vs example.com but it also addresses https://example.com vs https://example.com:443. Fixes unclecode#843

Add missing field end_time to TraversalStats what was being used by callers.

Add type hints to StringCompatibleMarkdown to improve code completion and readability.

Fix typo in LLMExtractionStrategy comment.

Improve exception reporting by raising from the original exception. Remove unnecessary catch and re-raise. Use traceback.format_exc to get the original exception message instead of a custom method which resulted in important information being lost.

Add the option to set the dispatcher for deep crawl strategies allowing it to be customized for different use cases.

Fix external URLs being returned in deep crawl results when included_external is set to False. This was caused by sites that redirect from internal URLs to external ones.

Fix max_pages being exceeded in deep crawl strategies. This was caused by in flight requests not being counted towards the max_pages limit.

Make crawl_url private to avoid exposing it in the public API eliminating warnings about conflicting method definitions.

Fix the retry logic in the RateLimiter to actually retry the request when the rate limit is hit. The previous implementation was not actually retrying the request, which could lead to unnecessary failures. Eliminate retry placeholders from the final results of deep crawls as they are not useful to the caller. Add doc comments to RateLimiter.

Handle common ratelimit headers instead of using a fixed exponential backoff. This prevents failures when the server returns a 429 or 503 status codes and headers indicating when the request can be retried and that is after we would have calculated with the exponential backoff.

Handle monitor methods in BaseDispatcher to avoid having to check if the monitor is set everywhere.

Disable UserWarning for code we don't control, which was making it hard to see our own warnings when running tests.

Replace the existing rate limiting mechanism with a token bucket algorithm to improve request handling efficiency and control. The previous implementation used a simple last request and current delay calculation, which could lead to uneven request distribution when multiple requests were made in quick succession. The result of this was the more links discovered by as single request the more likely it was that we would trigger 429 and 503 response codes, which combined with max_retries would cause the crawler to fail to successfully process all pages. The new implementation uses a token bucket algorithm, which allows for more flexible and efficient rate limiting. The token bucket algorithm allows for bursts of requests while maintaining an average rate over time. This combined with active rate limit header processing results in a much higher success rate for crawls. In addition the option to disable max retry limit, the new default, has been tested to confirm 100% success rate for crawls on a site which performs active rate limiting and previously resulted in a large number of failures. Finally the rate limiter starts with default rate which will be adjusted per domain based on the rate limit headers returned by the server.

coderabbitai · 2025-05-10T09:31:26Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR.
@coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

stevenh · 2025-05-10T09:32:37Z

This builds on top of other changes, which haven't yet been merged so would need that done first, hence leaving as draft.

stevenh added 20 commits April 10, 2025 14:58

chore: add relevant tests

f8da894

Add relevant tests for the following: - markdown generator - http crawler strategy

chore: add ruff settings to pyproject.toml

712b033

Add ruff settings to pyproject.toml to prevent deprecation warnings beyond our control and to disable formatting for now to avoid the noise it would generate.

fix: add end_time to TraversalStats

101d70f

Add missing field end_time to TraversalStats what was being used by callers.

chore: add hints to StringCompatibleMarkdown

b98d3a2

Add type hints to StringCompatibleMarkdown to improve code completion and readability.

chore: fix comment typo

2f95084

Fix typo in LLMExtractionStrategy comment.

chore: improve exception reporting

67bb96c

Improve exception reporting by raising from the original exception. Remove unnecessary catch and re-raise. Use traceback.format_exc to get the original exception message instead of a custom method which resulted in important information being lost.

feat: configurable dispatcher

ba63573

Add the option to set the dispatcher for deep crawl strategies allowing it to be customized for different use cases.

fix: externals in deep crawl results when disabled

0f4a4bc

Fix external URLs being returned in deep crawl results when included_external is set to False. This was caused by sites that redirect from internal URLs to external ones.

fix: max_pages in deep crawl

8bc8f84

Fix max_pages being exceeded in deep crawl strategies. This was caused by in flight requests not being counted towards the max_pages limit.

fix: crawl_url method definition warnings

cc27444

Make crawl_url private to avoid exposing it in the public API eliminating warnings about conflicting method definitions.

chore: simplify dispatcher monitor calls

170b333

Handle monitor methods in BaseDispatcher to avoid having to check if the monitor is set everywhere.

chore: disable warnings for code we don't control

1a30ecb

Disable UserWarning for code we don't control, which was making it hard to see our own warnings when running tests.

stevenh mentioned this pull request May 10, 2025

[Bug]: RateLimiter provides ineffective protection against failures #1095

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: token bucket rate limiting #1094

feat: token bucket rate limiting #1094

stevenh commented May 10, 2025 •

edited

Loading

coderabbitai bot commented May 10, 2025

Review skipped

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

stevenh commented May 10, 2025

feat: token bucket rate limiting #1094

Are you sure you want to change the base?

feat: token bucket rate limiting #1094

Conversation

stevenh commented May 10, 2025 • edited Loading

Summary

How Has This Been Tested?

Checklist:

coderabbitai bot commented May 10, 2025

Review skipped

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

stevenh commented May 10, 2025

stevenh commented May 10, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)