Skip to content

Improved delay handling#57

Open
EgbertW wants to merge 1 commit intoyasserg:masterfrom
EgbertW:delay-handling
Open

Improved delay handling#57
EgbertW wants to merge 1 commit intoyasserg:masterfrom
EgbertW:delay-handling

Conversation

@EgbertW
Copy link
Copy Markdown
Contributor

@EgbertW EgbertW commented May 20, 2015

Note: I recently discovered that you moved from google code to Github. I've been using and modifying crawlerj4 for a while now and now I'm able to submit pull requests to allow you to merge them. I rebased all my (relevant) patches on the new master, so they should be easy to merge. I hope you consider some or all of them useful.

Description of this patch:
The original version had a politeness delay per CrawlController / PageFetcher instance which means that any two requests are at least 200 ms apart. This only makes sense for the same host. The new approach stores the last fetch time for each hostname and maintains a list of recently visited hosts, making sure that at least 200 ms are between requests to the same host while allowing multiple requests within the same ms over different hosts. Now, to optimally utilize this, the distribution of getNextURLs from the frontier should be such that the pages are distributed over the various hosts. For this, a method to return the best URL to visit now, given a
selection of URLs, has been added. It checks the delay that would result from each available URL and if any delay is 0 it is returned. If none result in no delay, the one with the lowest delay is returned. Reduces the delay significantly.

CrawlController / PageFetcher instance which means that any two requests
are at least 200 ms apart. This only makes sense for the same host. The
new approach stores the last fetch time for each hostname and maintains
a list of recently visited hosts, making sure that at least 200 ms are
between requests to the same host while allowing multiple requests
within the same ms over different hosts. Now, to optimally utilize this,
the distribution of getNextURLs from the frontier should be such that
the pages are distributed over the various hosts. This is still a todo.

Additionally, a method to return the best URL to visit now, given a
selection of URLs, has been added. It checks the delay that would result
from each available URL and if any delay is 0 it is returned. If none
result in no delay, the one with the lowest delay is returned. Reduces
the delay significantly.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant