Improved delay handling#57
Open
EgbertW wants to merge 1 commit intoyasserg:masterfrom
Open
Conversation
CrawlController / PageFetcher instance which means that any two requests are at least 200 ms apart. This only makes sense for the same host. The new approach stores the last fetch time for each hostname and maintains a list of recently visited hosts, making sure that at least 200 ms are between requests to the same host while allowing multiple requests within the same ms over different hosts. Now, to optimally utilize this, the distribution of getNextURLs from the frontier should be such that the pages are distributed over the various hosts. This is still a todo. Additionally, a method to return the best URL to visit now, given a selection of URLs, has been added. It checks the delay that would result from each available URL and if any delay is 0 it is returned. If none result in no delay, the one with the lowest delay is returned. Reduces the delay significantly.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Note: I recently discovered that you moved from google code to Github. I've been using and modifying crawlerj4 for a while now and now I'm able to submit pull requests to allow you to merge them. I rebased all my (relevant) patches on the new master, so they should be easy to merge. I hope you consider some or all of them useful.
Description of this patch:
The original version had a politeness delay per CrawlController / PageFetcher instance which means that any two requests are at least 200 ms apart. This only makes sense for the same host. The new approach stores the last fetch time for each hostname and maintains a list of recently visited hosts, making sure that at least 200 ms are between requests to the same host while allowing multiple requests within the same ms over different hosts. Now, to optimally utilize this, the distribution of getNextURLs from the frontier should be such that the pages are distributed over the various hosts. For this, a method to return the best URL to visit now, given a
selection of URLs, has been added. It checks the delay that would result from each available URL and if any delay is 0 it is returned. If none result in no delay, the one with the lowest delay is returned. Reduces the delay significantly.