Skip to content

Crawler - HostBalancer - Restrictive sites slow the multi-domain crawling #753

@okybaca

Description

@okybaca

Problem description

While crawling, some sites do require so much crawler delay, that YaCy
pauses the crawling, printing
HostQueue * waiting for www.irozhlas.cz: 8 seconds remaining...
to the log. This is done even when the queue is loaded with many other
(faster) hosts, resulting in degrading of overall crawl speed. Crawler is
waiting for the slow host, instead of crawling the faster in meantime.

Desired behavior

The HostBalancer could skip the slow site as many times as required by crawl
delay, and instead of waiting, crawl the rest of 'fast' domains. The queue
wouldn't be blocked then.

Described probably also in the forum
by @smokingwheels.

Might be related also to #524 and #638.

Example sites

irozhlas.cz
globalvoices.org
democracynow.org
eff.org
www.americanscientist.org
bangkoknews.net
...

Even my own webhost sometimes falls in this mode, probably because using
apache module mod_evasive.
Which is host-side part of perspective.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugIndicates an unexpected problem or unintended behaviorcrawler

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions