You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
fix(feeds): unblock feed-validation workflow (split SSRF-drift from third-party state) (#3885)
* fix(feeds): unblock feed-validation workflow
The Feed Validation workflow (.github/workflows/feed-validation.yml) has
been failing 100% of runs since it landed in PR #3872 — every push to
main + every scheduled 6h run. Five root causes, all addressed here:
1. fast-xml-parser default entity-expansion limit was tripping on
legitimate large feeds (Guardian, Fox, Axios, CISA, WHO, MIT,
Defense One, Folha, El País, iefimerida, GitHub Trending,
Dev.to, Oryx OSINT, …). We only read date strings from the
parsed doc, so processEntities:false is safe and recovers all
17 false-positive DEAD rows.
2. 10 hosts referenced from src/config/feeds.ts were absent from
the 5-file allowlist mirror set (shared/rss-allowed-domains.{json,cjs},
scripts/shared/rss-allowed-domains.json, api/_rss-allowed-domains.js,
vite.config.ts:RSS_PROXY_ALLOWED_DOMAINS). Added: abcnews.go.com +
abcnews.com (feeds.abcnews.com → abcnews.go.com → abcnews.com
two-hop chain), www.corriere.it, www.rt.com, www.alarabiya.net,
tuoitrenews.vn, www.yonhapnewstv.co.kr, www.chosun.com,
rss.libsyn.com, feeds.megaphone.fm, rss.art19.com. The same
allowlist gates the prod Edge rss-proxy, so this also silently
restores access to these feeds for live users.
3. BBC Persian was declared as plaintext http://, rejected by the
--ci https-only guard. Updated to the canonical
https://feeds.bbci.co.uk/persian/rss.xml (server-side mirror
already had this).
4. Tom's Hardware /feeds/all redirects to http://… on the same
host, tripping the per-hop https guard. The canonical https
path is /feeds.xml — switched both client and server mirrors.
5. Validator was hard-failing on any STALE-or-DEAD row, which made
the workflow noise floor unbearable: 8 stale + 32 dead = 40
reasons to be red, of which only 10 were actionable. Split the
exit policy: HARD-FAIL on config/SSRF-guard drift (allowlist
miss, plaintext URL, redirect loop) so future drift is loud,
SOFT-FAIL (exit 0 with warn) on third-party 4xx/timeouts/STALE
so a feed disappearing upstream doesn't page anyone. Promoting
third-party failures to hard-fail can wait for a registry
grooming PR.
Also bumps the scheduled cadence from every-6h to daily-00:00-UTC.
4× the discovery rate added zero value — feed outages don't change
faster than once-a-day, and 542 feeds × 4 runs/day was wasted
runner-minutes and third-party fetch volume.
Local validator result (after the fix):
Summary: 512 OK, 10 stale, 6 dead, 13 empty, 1 skipped
Exit: 0 (no config drift). 6 remaining DEAD are all genuine
third-party state (Brasil Paralelo 404, EIA Reports 404 [duplicate
entry], News24 403, Tuoi Tre + Al Arabiya unreachable from this
network) — candidates for a future registry-cleanup PR.
Test coverage: tests/feeds-client-server-parity.test.mjs,
tests/feed-resolution.test.mts, tests/feeds-time-gate.test.mts —
all green. Full test:data suite — green.
* fix(feeds): address Greptile P1 + P2 on validate-rss-feeds.mjs
P1: FAIL message claimed "4 allowlist mirrors" but the codebase enforces 5
(scripts/shared/rss-allowed-domains.json was missing from the list).
A developer following the message would skip that mirror and hit a
puzzling secondary failure from the `test:data` scripts/shared parity
check. Listed all 5 mirrors.
P2: The isConfigDrift predicate was prefix-matching against literal
copies of the error message strings thrown by assertCiAllowed and
fetchFeed. A future reword at either throwing site (e.g. dropping the
"(--ci)" annotation, rewording the redirect error) would silently
reclassify config drift as third-party rot, demoting a hard fail to a
soft warning. Centralised the sentinel strings as a frozen
CONFIG_DRIFT_REASONS object that both the throwing sites and the
classifier consume — rename a reason and BOTH consumers move in
lockstep. INVALID_URL is also now properly classified as config drift
(was previously falling through to soft-fail despite being an actor-
fixable bug in feeds.ts).
Tested:
- End-to-end run: 512 OK / 9 stale / 7 dead / 13 empty, EXIT=0
- Classifier unit test: all 8 representative cases correct
(4 config-drift reasons → true, 4 third-party reasons → false)
* fix(feeds): allowlist idp.nature.com (Nature SSO redirect hop)
Nature publishes a session/IP-conditional redirect chain on
feeds.nature.com — on some networks the request lands directly at
www.nature.com/nature.rss (both already in the allowlist), but on
others (apparently GitHub Actions runner IPs) Nature inserts an
idp.nature.com SSO/identity-provider hop:
feeds.nature.com → idp.nature.com → www.nature.com/nature.rss
The validator's per-hop allowlist re-check fails on idp.nature.com.
Adding it to all 4 hand-maintained mirrors (+ the .cjs that auto-syncs
via require) closes the gap.
Same shape as the abcnews.go.com fix on the original PR — the lesson is
that allowlist audits done from a developer laptop can miss intermediate
redirect hops that only appear under different network egress paths.
Documented in worldmonitor-architecture-gotchas/.../multi-hop-redirect-
chain-needs-every-host-in-allowlist.md (skill added in PR #3885 first
round).
Also addresses the reviewer's second P1 finding (Invalid URL being
soft-fail instead of hard-fail): already fixed in 8a34058 — the
reviewer's audit was against the pre-Greptile commit cbb80e1.
INVALID_URL is now in CONFIG_DRIFT_REASONS and isConfigDrift hard-fails
malformed registry entries.
0 commit comments