Skip to content

Retry Improvements + Rate Limit Support #758

Open
@ikreymer

Description

@ikreymer

Following up to #132 (and also #392, #360) , we need a more sophisticated retry strategy, also considering what do with rate limiting status code.
We already have --failOnInvalidStatus, --maxPageRetries, --failOnFailedSeed and --failOnFailedLimit and probably need to add a few more flags.

This is getting slightly messy, but hopefully there's a clear path to figure this out.

There's a few options to consider:

  • Which status code should be counted as page failures, for purposes of ending crawl
  • Which status codes should result in retrying the page
  • Should capture of pages with invalid status codes be skipped when they will be retried.
  • Which status code should result in slowing down the crawl / adding a delay before loading those pages again if retrying..

It's probably useful to list the various use cases:

  • The crawler should treat 4xx and 5xx as failed, possibly customizing which status codes are included?
  • The crawler should fail the crawl if a certain number of pages have failed or if any of the seeds have failed.
  • The crawler should retry failed pages a certain number of times, possibly customizing which status codes are eligible for retries.
  • The crawler should not write any data for pages that are being retried, until the final retry.

With this in mind, probably should add at least a:

  • --retryStatusCodes flag which indicates which status codes will be retried.
  • Is there a need to also specify --invalidStatusCodes that is separate from --retryStatusCodes? Leaning against it.
  • Is there a need to also specify if failed pages that are being retried should be captured to WARC? Sort of leaning against it as well, since retries are part of the capture process
  • How to handle rate limiting, eg. add exponential backoff via pageExtraDelay for certain status codes, like 429, 503 maybe 403.. Possibly using Retry-After, if available (from Slow down + retry on HTTP 429 errors #392)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Status

    Triage

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions