fix(feeds): unblock feed-validation workflow (split SSRF-drift from third-party state)#3885
Conversation
The Feed Validation workflow (.github/workflows/feed-validation.yml) has been failing 100% of runs since it landed in PR #3872 — every push to main + every scheduled 6h run. Five root causes, all addressed here: 1. fast-xml-parser default entity-expansion limit was tripping on legitimate large feeds (Guardian, Fox, Axios, CISA, WHO, MIT, Defense One, Folha, El País, iefimerida, GitHub Trending, Dev.to, Oryx OSINT, …). We only read date strings from the parsed doc, so processEntities:false is safe and recovers all 17 false-positive DEAD rows. 2. 10 hosts referenced from src/config/feeds.ts were absent from the 5-file allowlist mirror set (shared/rss-allowed-domains.{json,cjs}, scripts/shared/rss-allowed-domains.json, api/_rss-allowed-domains.js, vite.config.ts:RSS_PROXY_ALLOWED_DOMAINS). Added: abcnews.go.com + abcnews.com (feeds.abcnews.com → abcnews.go.com → abcnews.com two-hop chain), www.corriere.it, www.rt.com, www.alarabiya.net, tuoitrenews.vn, www.yonhapnewstv.co.kr, www.chosun.com, rss.libsyn.com, feeds.megaphone.fm, rss.art19.com. The same allowlist gates the prod Edge rss-proxy, so this also silently restores access to these feeds for live users. 3. BBC Persian was declared as plaintext http://, rejected by the --ci https-only guard. Updated to the canonical https://feeds.bbci.co.uk/persian/rss.xml (server-side mirror already had this). 4. Tom's Hardware /feeds/all redirects to http://… on the same host, tripping the per-hop https guard. The canonical https path is /feeds.xml — switched both client and server mirrors. 5. Validator was hard-failing on any STALE-or-DEAD row, which made the workflow noise floor unbearable: 8 stale + 32 dead = 40 reasons to be red, of which only 10 were actionable. Split the exit policy: HARD-FAIL on config/SSRF-guard drift (allowlist miss, plaintext URL, redirect loop) so future drift is loud, SOFT-FAIL (exit 0 with warn) on third-party 4xx/timeouts/STALE so a feed disappearing upstream doesn't page anyone. Promoting third-party failures to hard-fail can wait for a registry grooming PR. Also bumps the scheduled cadence from every-6h to daily-00:00-UTC. 4× the discovery rate added zero value — feed outages don't change faster than once-a-day, and 542 feeds × 4 runs/day was wasted runner-minutes and third-party fetch volume. Local validator result (after the fix): Summary: 512 OK, 10 stale, 6 dead, 13 empty, 1 skipped Exit: 0 (no config drift). 6 remaining DEAD are all genuine third-party state (Brasil Paralelo 404, EIA Reports 404 [duplicate entry], News24 403, Tuoi Tre + Al Arabiya unreachable from this network) — candidates for a future registry-cleanup PR. Test coverage: tests/feeds-client-server-parity.test.mjs, tests/feed-resolution.test.mts, tests/feeds-time-gate.test.mts — all green. Full test:data suite — green.
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
Greptile SummaryThis PR fixes five independent root causes that were causing the Feed Validation workflow to fail 100% since its launch: disabling entity expansion in the XML parser (recovers 17 false-positive DEAD rows), adding 11 missing redirect-chain domains to all allowlist mirrors, correcting two feed URLs (BBC Persian http→https, Tom's Hardware canonical path), and splitting the exit policy so only SSRF/config-guard violations are hard failures while third-party rot is downgraded to a warning.
Confidence Score: 4/5Safe to merge; the allowlist and URL fixes are correct and the exit-policy logic is sound. One error message omits a required mirror from its fix instructions, which would leave a developer chasing a secondary test failure. The core logic changes are well-reasoned and the five root-cause fixes are each independently verifiable. The only real defect is in the FAIL message: it tells developers to update four mirrors but the codebase enforces five. A developer acting on that message would leave the scripts mirror stale and hit a confusing secondary failure. scripts/validate-rss-feeds.mjs — the isConfigDrift predicate and the FAIL error message both deserve a second look. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[validate-rss-feeds.mjs --ci] --> B[extractFeeds from feeds.ts]
B --> C[runBatch: fetchFeed x 542 feeds]
C --> D{CI mode per-hop redirect check}
D -->|assertCiAllowed pass| E[fetch with redirect:manual]
D -->|Non-https scheme| F[DEAD: Non-https scheme rejected]
D -->|Host not in allowlist| G[DEAD: Host not in allowlist]
E -->|3xx| H{hop < MAX_HOPS=3?}
H -->|yes| D
H -->|no| I[DEAD: Too many redirects]
E -->|200 OK| J[parseNewestDate processEntities:false]
E -->|4xx/5xx/timeout| K[DEAD: HTTP N / Timeout]
J -->|no dates| L[EMPTY]
J -->|age > 30d| M[STALE]
J -->|age le 30d| N[OK]
F & G & I --> O{isConfigDrift?}
K --> O
O -->|yes| P[configDrift list]
O -->|no| Q[thirdPartyDead list]
P --> R{configDrift.length > 0?}
R -->|yes| S[console.error + EXIT 1 HARD FAIL]
R -->|no| T{stale/thirdPartyDead/empty?}
Q & M & L --> T
T -->|yes| U[console.warn + EXIT 0 SOFT FAIL]
T -->|no| V[EXIT 0 all green]
Reviews (1): Last reviewed commit: "fix(feeds): unblock feed-validation work..." | Re-trigger Greptile |
| `(allowlist drift or plaintext URL). Fix src/config/feeds.ts and/or the 4 ` + | ||
| `allowlist mirrors (shared/rss-allowed-domains.json, .cjs, ` + | ||
| `api/_rss-allowed-domains.js, vite.config.ts:RSS_PROXY_ALLOWED_DOMAINS).` |
There was a problem hiding this comment.
The FAIL message tells developers to update "4 allowlist mirrors" but omits
scripts/shared/rss-allowed-domains.json. A developer following these instructions would skip that mirror, causing the test:data scripts/shared parity check to fail with a separate, puzzling error unrelated to the config-drift message they just read.
| `(allowlist drift or plaintext URL). Fix src/config/feeds.ts and/or the 4 ` + | |
| `allowlist mirrors (shared/rss-allowed-domains.json, .cjs, ` + | |
| `api/_rss-allowed-domains.js, vite.config.ts:RSS_PROXY_ALLOWED_DOMAINS).` | |
| `(allowlist drift or plaintext URL). Fix src/config/feeds.ts and/or the 5 ` + | |
| `allowlist mirrors (shared/rss-allowed-domains.json, .cjs, ` + | |
| `scripts/shared/rss-allowed-domains.json, ` + | |
| `api/_rss-allowed-domains.js, vite.config.ts:RSS_PROXY_ALLOWED_DOMAINS).` |
There was a problem hiding this comment.
Fixed in 8a34058 — listed all 5 mirrors in the FAIL message (added the missing scripts/shared/rss-allowed-domains.json).
| const isConfigDrift = (r) => | ||
| typeof r.detail === 'string' && ( | ||
| r.detail.startsWith('Host not in allowlist') || | ||
| r.detail.startsWith('Non-https scheme rejected') || | ||
| r.detail === 'Too many redirects' | ||
| ); |
There was a problem hiding this comment.
isConfigDrift coupling to error message strings
isConfigDrift matches on the literal text of err.message strings thrown by assertCiAllowed and fetchFeed. If either message is ever edited (e.g., to drop the (--ci) annotation or reword the redirect error), the predicate silently stops classifying those cases as config drift, and what should be a hard fail becomes a soft-fail warning instead. Consider centralising these sentinel strings as exported constants shared between the throwing and the matching sites.
There was a problem hiding this comment.
Fixed in 8a34058 — centralised the SSRF-guard sentinel strings as a frozen CONFIG_DRIFT_REASONS object at the top of the file. Both the throwing sites (assertCiAllowed, fetchFeed) and the isConfigDrift classifier now consume the same constants, so a future reword moves both in lockstep. Also picked up Invalid URL as a 4th config-drift reason (was previously slipping through to soft-fail despite being an actor-fixable bug in feeds.ts).
P1: FAIL message claimed "4 allowlist mirrors" but the codebase enforces 5
(scripts/shared/rss-allowed-domains.json was missing from the list).
A developer following the message would skip that mirror and hit a
puzzling secondary failure from the `test:data` scripts/shared parity
check. Listed all 5 mirrors.
P2: The isConfigDrift predicate was prefix-matching against literal
copies of the error message strings thrown by assertCiAllowed and
fetchFeed. A future reword at either throwing site (e.g. dropping the
"(--ci)" annotation, rewording the redirect error) would silently
reclassify config drift as third-party rot, demoting a hard fail to a
soft warning. Centralised the sentinel strings as a frozen
CONFIG_DRIFT_REASONS object that both the throwing sites and the
classifier consume — rename a reason and BOTH consumers move in
lockstep. INVALID_URL is also now properly classified as config drift
(was previously falling through to soft-fail despite being an actor-
fixable bug in feeds.ts).
Tested:
- End-to-end run: 512 OK / 9 stale / 7 dead / 13 empty, EXIT=0
- Classifier unit test: all 8 representative cases correct
(4 config-drift reasons → true, 4 third-party reasons → false)
Nature publishes a session/IP-conditional redirect chain on feeds.nature.com — on some networks the request lands directly at www.nature.com/nature.rss (both already in the allowlist), but on others (apparently GitHub Actions runner IPs) Nature inserts an idp.nature.com SSO/identity-provider hop: feeds.nature.com → idp.nature.com → www.nature.com/nature.rss The validator's per-hop allowlist re-check fails on idp.nature.com. Adding it to all 4 hand-maintained mirrors (+ the .cjs that auto-syncs via require) closes the gap. Same shape as the abcnews.go.com fix on the original PR — the lesson is that allowlist audits done from a developer laptop can miss intermediate redirect hops that only appear under different network egress paths. Documented in worldmonitor-architecture-gotchas/.../multi-hop-redirect- chain-needs-every-host-in-allowlist.md (skill added in PR #3885 first round). Also addresses the reviewer's second P1 finding (Invalid URL being soft-fail instead of hard-fail): already fixed in 8a34058 — the reviewer's audit was against the pre-Greptile commit cbb80e1. INVALID_URL is now in CONFIG_DRIFT_REASONS and isConfigDrift hard-fails malformed registry entries.
Summary
Feed Validation workflow has been failing 100% since launch (PR #3872) — every push to main + every scheduled run. Root-cause review turned up five independent issues; this PR fixes all five and is enough to flip the workflow green.
Local validator result after the fix:
(Was:
489 OK, 8 stale, 32 dead, 12 empty, 1 skipped, EXIT=1)Changes
processEntities: falseon the XML parser —fast-xml-parserv5's default entity-expansion limit was tripping on legit large feeds (Guardian, Fox, Axios, CISA, WHO, MIT, Defense One, Folha, El País, iefimerida, GitHub Trending, Dev.to, Oryx OSINT). We only extract date strings, never decode entity-bearing content. Recovers 17 false-positive DEAD rows.10 hosts added to the allowlist mirror set —
abcnews.go.com+abcnews.com(covers thefeeds.abcnews.comtwo-hop redirect chain),www.corriere.it,www.rt.com,www.alarabiya.net,tuoitrenews.vn,www.yonhapnewstv.co.kr,www.chosun.com,rss.libsyn.com,feeds.megaphone.fm,rss.art19.com. Synced across all 5 mirrors (shared/rss-allowed-domains.{json,cjs},scripts/shared/rss-allowed-domains.json,api/_rss-allowed-domains.js,vite.config.ts:RSS_PROXY_ALLOWED_DOMAINS). The same allowlist gates the prod Edge rss-proxy — also silently restores these feeds for live users.BBC Persian — declared as plaintext
http://feeds.bbci.co.uk/persian/tv-and-radio-37434376/rss.xml, rejected by the--cihttps-only guard. Switched to canonicalhttps://feeds.bbci.co.uk/persian/rss.xml. Server-side mirror already had this URL — fixing client-side drift.Tom's Hardware —
/feeds/all301s tohttp://on the same host, tripping the per-hop https guard. Canonical https path is/feeds.xml. Updated client + server mirrors.Exit-policy split — was: hard-fail on any STALE or DEAD row. Now:
Host not in allowlist,Non-https scheme rejected,Too many redirects) — these are bugs the maintainer can fix; staying loud catches future drift.Cron 6h → daily — 542 feeds × 4 runs/day was 4× the necessary discovery rate; feed outages don't change that fast and the prior workflow proved no one acts on the noise.
Remaining warnings (informational, not failing)
After the fix, these are still flagged but don't block:
EIA ReportsandEIA Press Room), News24 403, Tuoi Tre + Al Arabiya unreachable from this network.parseNewestDate()doesn't walk every date shape.These warrant a follow-up registry-grooming PR but don't gate this fix.
Test plan
npm run typecheck— greennpm run test:data— green (covers feeds-client-server-parity, feed-resolution, feeds-time-gate, 5-mirror allowlist parity)node scripts/validate-rss-feeds.mjs --cilocally — EXIT=0 with the new policy