Skip to content

Scraper: increase HTTP timeout and set a User-Agent header #3513

@cuihtlauac

Description

@cuihtlauac

Summary

The planet scraper (via the river library) has two issues causing avoidable feed fetch failures:

1. HTTP timeout is too short (3 seconds)

The timeout in river's lib/http.ml is hardcoded to 3 seconds. Several feeds that are perfectly reachable consistently time out in CI:

  • https://ocamlpro.com/blog/feed — works (200) but slow
  • https://mirage.io/feed.xml — works (200) but slow
  • https://hannes.robur.coop/atom — intermittent
  • https://blog.robur.coop/feed.xml — intermittent
  • https://jon.recoil.org/atom.xml — intermittent

Proposed fix: increase the timeout to 10 seconds.

2. No User-Agent header

The scraper sends HTTP requests without a User-Agent header (cohttp default). Some sites behind Cloudflare or similar CDNs reject requests without a recognized user agent:

  • https://priver.dev/tags/ocaml/index.xml — returns 403 Forbidden from CI, but 200 with a browser user agent

Proposed fix: set a common browser User-Agent header on each request. Cohttp_lwt_unix.Client.get accepts an optional ~headers parameter.

How river is managed

river is pinned to a specific commit via ocamlorg.opam.template:

pin-depends: [
  ["river.dev" "git+https://github.com/aantron/river#476dc945a908a69548bddd267f143a3e5d9c8a1a"]
]

This is a fork of kayceesrk/river. To apply the fixes:

  1. Submit a PR to aantron/river (or kayceesrk/river) with the timeout and User-Agent changes
  2. Update the pin hash in ocamlorg.opam.template to the new commit

Context

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status
    📋 Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions