oh-bugimporters should do per-domain backoff

Comment by <a href='https://openhatch.org/people/paulproteus/'>paulproteus</a>:<dl><dd>Some bug trackers (openhatch.org/bugs/ especially...) if you request more than 1-
2 bugs per second report HTTP 504 Gateway Timeout.

The way Scrapy handles this now is in the 
<a href="http://doc.scrapy.org/en/0.12/topics/downloader-middleware.html#module-">http://doc.scrapy.org/en/0.12/topics/downloader-middleware.html#module-</a>
scrapy.contrib.downloadermiddleware.retry middleware, which re-queues the job but 
doesn't insist on a time delay.

It'd be nice to have a custom RetryMiddleware that did per-domain backoff. (Note 
that we're sort of abusing the Scrapy architecture; we're supposed to have one 
"spider" class per domain, but instead we only have one.)

One way to do this is to provide a custom subclass of 
scrapy.contrib.downloadermiddleware.retry.RetryMiddleware and then override the 
_retry method.

That should let us more reliably crawl some of the sites that are quite finnicky.

</dl></dd><hr/>

Status: unread
Nosy List: paulproteus 
Priority: wish
Imported from roundup ID: 793 (<a href='http://roundup-archive.openhatch.org/bugs/issue793'>view archived page</a>)
Last modified: 2012-11-20.16:04:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

oh-bugimporters should do per-domain backoff #81

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

oh-bugimporters should do per-domain backoff #81

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions