-
-
Notifications
You must be signed in to change notification settings - Fork 471
Description
Problem description
While crawling, some sites do require so much crawler delay, that YaCy
pauses the crawling, printing
HostQueue * waiting for www.irozhlas.cz: 8 seconds remaining...
to the log. This is done even when the queue is loaded with many other
(faster) hosts, resulting in degrading of overall crawl speed. Crawler is
waiting for the slow host, instead of crawling the faster in meantime.
Desired behavior
The HostBalancer could skip the slow site as many times as required by crawl
delay, and instead of waiting, crawl the rest of 'fast' domains. The queue
wouldn't be blocked then.
Described probably also in the forum
by @smokingwheels.
Might be related also to #524 and #638.
Example sites
irozhlas.cz
globalvoices.org
democracynow.org
eff.org
www.americanscientist.org
bangkoknews.net
...
Even my own webhost sometimes falls in this mode, probably because using
apache module mod_evasive.
Which is host-side part of perspective.