Summary
The planet scraper (via the river library) has two issues causing avoidable feed fetch failures:
1. HTTP timeout is too short (3 seconds)
The timeout in river's lib/http.ml is hardcoded to 3 seconds. Several feeds that are perfectly reachable consistently time out in CI:
https://ocamlpro.com/blog/feed — works (200) but slow
https://mirage.io/feed.xml — works (200) but slow
https://hannes.robur.coop/atom — intermittent
https://blog.robur.coop/feed.xml — intermittent
https://jon.recoil.org/atom.xml — intermittent
Proposed fix: increase the timeout to 10 seconds.
2. No User-Agent header
The scraper sends HTTP requests without a User-Agent header (cohttp default). Some sites behind Cloudflare or similar CDNs reject requests without a recognized user agent:
https://priver.dev/tags/ocaml/index.xml — returns 403 Forbidden from CI, but 200 with a browser user agent
Proposed fix: set a common browser User-Agent header on each request. Cohttp_lwt_unix.Client.get accepts an optional ~headers parameter.
How river is managed
river is pinned to a specific commit via ocamlorg.opam.template:
pin-depends: [
["river.dev" "git+https://github.com/aantron/river#476dc945a908a69548bddd267f143a3e5d9c8a1a"]
]
This is a fork of kayceesrk/river. To apply the fixes:
- Submit a PR to
aantron/river (or kayceesrk/river) with the timeout and User-Agent changes
- Update the pin hash in
ocamlorg.opam.template to the new commit
Context
Summary
The planet scraper (via the
riverlibrary) has two issues causing avoidable feed fetch failures:1. HTTP timeout is too short (3 seconds)
The timeout in
river'slib/http.mlis hardcoded to 3 seconds. Several feeds that are perfectly reachable consistently time out in CI:https://ocamlpro.com/blog/feed— works (200) but slowhttps://mirage.io/feed.xml— works (200) but slowhttps://hannes.robur.coop/atom— intermittenthttps://blog.robur.coop/feed.xml— intermittenthttps://jon.recoil.org/atom.xml— intermittentProposed fix: increase the timeout to 10 seconds.
2. No User-Agent header
The scraper sends HTTP requests without a
User-Agentheader (cohttp default). Some sites behind Cloudflare or similar CDNs reject requests without a recognized user agent:https://priver.dev/tags/ocaml/index.xml— returns 403 Forbidden from CI, but 200 with a browser user agentProposed fix: set a common browser
User-Agentheader on each request.Cohttp_lwt_unix.Client.getaccepts an optional~headersparameter.How
riveris managedriveris pinned to a specific commit viaocamlorg.opam.template:This is a fork of
kayceesrk/river. To apply the fixes:aantron/river(orkayceesrk/river) with the timeout and User-Agent changesocamlorg.opam.templateto the new commitContext