Skip to content

fix(feeds): unblock feed-validation workflow (split SSRF-drift from third-party state)#3885

Merged
koala73 merged 3 commits into
mainfrom
fix/feed-validation-unblock
May 24, 2026
Merged

fix(feeds): unblock feed-validation workflow (split SSRF-drift from third-party state)#3885
koala73 merged 3 commits into
mainfrom
fix/feed-validation-unblock

Conversation

@koala73
Copy link
Copy Markdown
Owner

@koala73 koala73 commented May 24, 2026

Summary

Feed Validation workflow has been failing 100% since launch (PR #3872) — every push to main + every scheduled run. Root-cause review turned up five independent issues; this PR fixes all five and is enough to flip the workflow green.

Local validator result after the fix:

Summary: 512 OK, 10 stale, 6 dead, 13 empty, 1 skipped
EXIT=0

(Was: 489 OK, 8 stale, 32 dead, 12 empty, 1 skipped, EXIT=1)

Changes

  1. processEntities: false on the XML parserfast-xml-parser v5's default entity-expansion limit was tripping on legit large feeds (Guardian, Fox, Axios, CISA, WHO, MIT, Defense One, Folha, El País, iefimerida, GitHub Trending, Dev.to, Oryx OSINT). We only extract date strings, never decode entity-bearing content. Recovers 17 false-positive DEAD rows.

  2. 10 hosts added to the allowlist mirror setabcnews.go.com + abcnews.com (covers the feeds.abcnews.com two-hop redirect chain), www.corriere.it, www.rt.com, www.alarabiya.net, tuoitrenews.vn, www.yonhapnewstv.co.kr, www.chosun.com, rss.libsyn.com, feeds.megaphone.fm, rss.art19.com. Synced across all 5 mirrors (shared/rss-allowed-domains.{json,cjs}, scripts/shared/rss-allowed-domains.json, api/_rss-allowed-domains.js, vite.config.ts:RSS_PROXY_ALLOWED_DOMAINS). The same allowlist gates the prod Edge rss-proxy — also silently restores these feeds for live users.

  3. BBC Persian — declared as plaintext http://feeds.bbci.co.uk/persian/tv-and-radio-37434376/rss.xml, rejected by the --ci https-only guard. Switched to canonical https://feeds.bbci.co.uk/persian/rss.xml. Server-side mirror already had this URL — fixing client-side drift.

  4. Tom's Hardware/feeds/all 301s to http:// on the same host, tripping the per-hop https guard. Canonical https path is /feeds.xml. Updated client + server mirrors.

  5. Exit-policy split — was: hard-fail on any STALE or DEAD row. Now:

    • HARD-FAIL on config/SSRF-guard drift (Host not in allowlist, Non-https scheme rejected, Too many redirects) — these are bugs the maintainer can fix; staying loud catches future drift.
    • SOFT-FAIL (exit 0 with warning) on third-party 4xx/timeouts/STALE/EMPTY — these rot naturally and a failing build produces no signal because no one acts on it.
  6. Cron 6h → daily — 542 feeds × 4 runs/day was 4× the necessary discovery rate; feed outages don't change that fast and the prior workflow proved no one acts on the noise.

Remaining warnings (informational, not failing)

After the fix, these are still flagged but don't block:

  • 6 DEAD: Brasil Paralelo 404, EIA Reports 404 (duplicate entry — same URL listed twice as EIA Reports and EIA Press Room), News24 403, Tuoi Tre + Al Arabiya unreachable from this network.
  • 10 STALE (>30 days): Ynetnews (2024-03!), Corriere (2024-05), Oryx, Jerusalem Post (2025-06), FAS, Pentagon, DigiChina, Y Combinator Blog, RAND.
  • 13 EMPTY: ArXiv AI/ML, IAEA, CrisisWatch, Bild, Asharq Business, France 24 [ar], Greater Good Berkeley, Primicias, GitHub Trending, DL News, Wu Blockchain, Trade & Tariffs — most parse fine but parseNewestDate() doesn't walk every date shape.

These warrant a follow-up registry-grooming PR but don't gate this fix.

Test plan

  • npm run typecheck — green
  • npm run test:data — green (covers feeds-client-server-parity, feed-resolution, feeds-time-gate, 5-mirror allowlist parity)
  • node scripts/validate-rss-feeds.mjs --ci locally — EXIT=0 with the new policy
  • Watch the post-merge Feed Validation workflow run on main turn green for the first time

The Feed Validation workflow (.github/workflows/feed-validation.yml) has
been failing 100% of runs since it landed in PR #3872 — every push to
main + every scheduled 6h run. Five root causes, all addressed here:

1. fast-xml-parser default entity-expansion limit was tripping on
   legitimate large feeds (Guardian, Fox, Axios, CISA, WHO, MIT,
   Defense One, Folha, El País, iefimerida, GitHub Trending,
   Dev.to, Oryx OSINT, …). We only read date strings from the
   parsed doc, so processEntities:false is safe and recovers all
   17 false-positive DEAD rows.

2. 10 hosts referenced from src/config/feeds.ts were absent from
   the 5-file allowlist mirror set (shared/rss-allowed-domains.{json,cjs},
   scripts/shared/rss-allowed-domains.json, api/_rss-allowed-domains.js,
   vite.config.ts:RSS_PROXY_ALLOWED_DOMAINS). Added: abcnews.go.com +
   abcnews.com (feeds.abcnews.com → abcnews.go.com → abcnews.com
   two-hop chain), www.corriere.it, www.rt.com, www.alarabiya.net,
   tuoitrenews.vn, www.yonhapnewstv.co.kr, www.chosun.com,
   rss.libsyn.com, feeds.megaphone.fm, rss.art19.com. The same
   allowlist gates the prod Edge rss-proxy, so this also silently
   restores access to these feeds for live users.

3. BBC Persian was declared as plaintext http://, rejected by the
   --ci https-only guard. Updated to the canonical
   https://feeds.bbci.co.uk/persian/rss.xml (server-side mirror
   already had this).

4. Tom's Hardware /feeds/all redirects to http://… on the same
   host, tripping the per-hop https guard. The canonical https
   path is /feeds.xml — switched both client and server mirrors.

5. Validator was hard-failing on any STALE-or-DEAD row, which made
   the workflow noise floor unbearable: 8 stale + 32 dead = 40
   reasons to be red, of which only 10 were actionable. Split the
   exit policy: HARD-FAIL on config/SSRF-guard drift (allowlist
   miss, plaintext URL, redirect loop) so future drift is loud,
   SOFT-FAIL (exit 0 with warn) on third-party 4xx/timeouts/STALE
   so a feed disappearing upstream doesn't page anyone. Promoting
   third-party failures to hard-fail can wait for a registry
   grooming PR.

Also bumps the scheduled cadence from every-6h to daily-00:00-UTC.
4× the discovery rate added zero value — feed outages don't change
faster than once-a-day, and 542 feeds × 4 runs/day was wasted
runner-minutes and third-party fetch volume.

Local validator result (after the fix):
  Summary: 512 OK, 10 stale, 6 dead, 13 empty, 1 skipped
  Exit: 0 (no config drift). 6 remaining DEAD are all genuine
  third-party state (Brasil Paralelo 404, EIA Reports 404 [duplicate
  entry], News24 403, Tuoi Tre + Al Arabiya unreachable from this
  network) — candidates for a future registry-cleanup PR.

Test coverage: tests/feeds-client-server-parity.test.mjs,
tests/feed-resolution.test.mts, tests/feeds-time-gate.test.mts —
all green. Full test:data suite — green.
@vercel
Copy link
Copy Markdown

vercel Bot commented May 24, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
worldmonitor Ready Ready Preview, Comment May 24, 2026 2:32pm

Request Review

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 24, 2026

Greptile Summary

This PR fixes five independent root causes that were causing the Feed Validation workflow to fail 100% since its launch: disabling entity expansion in the XML parser (recovers 17 false-positive DEAD rows), adding 11 missing redirect-chain domains to all allowlist mirrors, correcting two feed URLs (BBC Persian http→https, Tom's Hardware canonical path), and splitting the exit policy so only SSRF/config-guard violations are hard failures while third-party rot is downgraded to a warning.

  • Exit-policy split (scripts/validate-rss-feeds.mjs): DEAD results are now partitioned into configDrift (hard-fail) vs. thirdPartyDead (soft-fail). isConfigDrift relies on prefix-matching against literal error message strings; any future rewording silently reclassifies hard failures as warnings.
  • Allowlist mirrors: all five data stores updated consistently; shared/rss-allowed-domains.cjs is a thin require() wrapper so it tracks automatically.
  • Workflow cadence: schedule reduced from every-6h to daily, cutting 542-feed runs from 4x/day to 1x/day.

Confidence Score: 4/5

Safe to merge; the allowlist and URL fixes are correct and the exit-policy logic is sound. One error message omits a required mirror from its fix instructions, which would leave a developer chasing a secondary test failure.

The core logic changes are well-reasoned and the five root-cause fixes are each independently verifiable. The only real defect is in the FAIL message: it tells developers to update four mirrors but the codebase enforces five. A developer acting on that message would leave the scripts mirror stale and hit a confusing secondary failure.

scripts/validate-rss-feeds.mjs — the isConfigDrift predicate and the FAIL error message both deserve a second look.

Important Files Changed

Filename Overview
scripts/validate-rss-feeds.mjs Exit-policy split (hard vs soft fail), processEntities fix, and isConfigDrift predicate. The FAIL message omits scripts/shared/rss-allowed-domains.json from its mirror checklist, misleading developers who try to fix config drift.
.github/workflows/feed-validation.yml Cron reduced from every-6h to daily (00:00 UTC). Straightforward, well-justified change.
shared/rss-allowed-domains.json 11 new domains added for ABC News redirect chain and podcast CDNs. Matches scripts/shared/ and api/ mirrors.
api/_rss-allowed-domains.js Same 11 domains appended to the Edge-function mirror. In sync with shared/rss-allowed-domains.json.
vite.config.ts 11 new domains inserted into RSS_PROXY_ALLOWED_DOMAINS Set, matching the other mirrors.
src/config/feeds.ts BBC Persian switched from http to canonical https URL; Tom's Hardware updated to /feeds.xml.
server/worldmonitor/news/v1/_feeds.ts Tom's Hardware server-side mirror updated to /feeds.xml, keeping client-server parity.
scripts/shared/rss-allowed-domains.json Identical change to shared/rss-allowed-domains.json — required for scripts/ mirror parity verified by test:data.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[validate-rss-feeds.mjs --ci] --> B[extractFeeds from feeds.ts]
    B --> C[runBatch: fetchFeed x 542 feeds]
    C --> D{CI mode per-hop redirect check}
    D -->|assertCiAllowed pass| E[fetch with redirect:manual]
    D -->|Non-https scheme| F[DEAD: Non-https scheme rejected]
    D -->|Host not in allowlist| G[DEAD: Host not in allowlist]
    E -->|3xx| H{hop < MAX_HOPS=3?}
    H -->|yes| D
    H -->|no| I[DEAD: Too many redirects]
    E -->|200 OK| J[parseNewestDate processEntities:false]
    E -->|4xx/5xx/timeout| K[DEAD: HTTP N / Timeout]
    J -->|no dates| L[EMPTY]
    J -->|age > 30d| M[STALE]
    J -->|age le 30d| N[OK]
    F & G & I --> O{isConfigDrift?}
    K --> O
    O -->|yes| P[configDrift list]
    O -->|no| Q[thirdPartyDead list]
    P --> R{configDrift.length > 0?}
    R -->|yes| S[console.error + EXIT 1 HARD FAIL]
    R -->|no| T{stale/thirdPartyDead/empty?}
    Q & M & L --> T
    T -->|yes| U[console.warn + EXIT 0 SOFT FAIL]
    T -->|no| V[EXIT 0 all green]
Loading

Reviews (1): Last reviewed commit: "fix(feeds): unblock feed-validation work..." | Re-trigger Greptile

Comment thread scripts/validate-rss-feeds.mjs Outdated
Comment on lines +321 to +323
`(allowlist drift or plaintext URL). Fix src/config/feeds.ts and/or the 4 ` +
`allowlist mirrors (shared/rss-allowed-domains.json, .cjs, ` +
`api/_rss-allowed-domains.js, vite.config.ts:RSS_PROXY_ALLOWED_DOMAINS).`
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 The FAIL message tells developers to update "4 allowlist mirrors" but omits scripts/shared/rss-allowed-domains.json. A developer following these instructions would skip that mirror, causing the test:data scripts/shared parity check to fail with a separate, puzzling error unrelated to the config-drift message they just read.

Suggested change
`(allowlist drift or plaintext URL). Fix src/config/feeds.ts and/or the 4 ` +
`allowlist mirrors (shared/rss-allowed-domains.json, .cjs, ` +
`api/_rss-allowed-domains.js, vite.config.ts:RSS_PROXY_ALLOWED_DOMAINS).`
`(allowlist drift or plaintext URL). Fix src/config/feeds.ts and/or the 5 ` +
`allowlist mirrors (shared/rss-allowed-domains.json, .cjs, ` +
`scripts/shared/rss-allowed-domains.json, ` +
`api/_rss-allowed-domains.js, vite.config.ts:RSS_PROXY_ALLOWED_DOMAINS).`

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 8a34058 — listed all 5 mirrors in the FAIL message (added the missing scripts/shared/rss-allowed-domains.json).

Comment thread scripts/validate-rss-feeds.mjs Outdated
Comment on lines +309 to +314
const isConfigDrift = (r) =>
typeof r.detail === 'string' && (
r.detail.startsWith('Host not in allowlist') ||
r.detail.startsWith('Non-https scheme rejected') ||
r.detail === 'Too many redirects'
);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 isConfigDrift coupling to error message strings

isConfigDrift matches on the literal text of err.message strings thrown by assertCiAllowed and fetchFeed. If either message is ever edited (e.g., to drop the (--ci) annotation or reword the redirect error), the predicate silently stops classifying those cases as config drift, and what should be a hard fail becomes a soft-fail warning instead. Consider centralising these sentinel strings as exported constants shared between the throwing and the matching sites.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 8a34058 — centralised the SSRF-guard sentinel strings as a frozen CONFIG_DRIFT_REASONS object at the top of the file. Both the throwing sites (assertCiAllowed, fetchFeed) and the isConfigDrift classifier now consume the same constants, so a future reword moves both in lockstep. Also picked up Invalid URL as a 4th config-drift reason (was previously slipping through to soft-fail despite being an actor-fixable bug in feeds.ts).

koala73 added 2 commits May 24, 2026 18:15
P1: FAIL message claimed "4 allowlist mirrors" but the codebase enforces 5
(scripts/shared/rss-allowed-domains.json was missing from the list).
A developer following the message would skip that mirror and hit a
puzzling secondary failure from the `test:data` scripts/shared parity
check. Listed all 5 mirrors.

P2: The isConfigDrift predicate was prefix-matching against literal
copies of the error message strings thrown by assertCiAllowed and
fetchFeed. A future reword at either throwing site (e.g. dropping the
"(--ci)" annotation, rewording the redirect error) would silently
reclassify config drift as third-party rot, demoting a hard fail to a
soft warning. Centralised the sentinel strings as a frozen
CONFIG_DRIFT_REASONS object that both the throwing sites and the
classifier consume — rename a reason and BOTH consumers move in
lockstep. INVALID_URL is also now properly classified as config drift
(was previously falling through to soft-fail despite being an actor-
fixable bug in feeds.ts).

Tested:
  - End-to-end run: 512 OK / 9 stale / 7 dead / 13 empty, EXIT=0
  - Classifier unit test: all 8 representative cases correct
    (4 config-drift reasons → true, 4 third-party reasons → false)
Nature publishes a session/IP-conditional redirect chain on
feeds.nature.com — on some networks the request lands directly at
www.nature.com/nature.rss (both already in the allowlist), but on
others (apparently GitHub Actions runner IPs) Nature inserts an
idp.nature.com SSO/identity-provider hop:

  feeds.nature.com → idp.nature.com → www.nature.com/nature.rss

The validator's per-hop allowlist re-check fails on idp.nature.com.
Adding it to all 4 hand-maintained mirrors (+ the .cjs that auto-syncs
via require) closes the gap.

Same shape as the abcnews.go.com fix on the original PR — the lesson is
that allowlist audits done from a developer laptop can miss intermediate
redirect hops that only appear under different network egress paths.
Documented in worldmonitor-architecture-gotchas/.../multi-hop-redirect-
chain-needs-every-host-in-allowlist.md (skill added in PR #3885 first
round).

Also addresses the reviewer's second P1 finding (Invalid URL being
soft-fail instead of hard-fail): already fixed in 8a34058 — the
reviewer's audit was against the pre-Greptile commit cbb80e1.
INVALID_URL is now in CONFIG_DRIFT_REASONS and isConfigDrift hard-fails
malformed registry entries.
@koala73 koala73 merged commit c2c280c into main May 24, 2026
12 checks passed
@koala73 koala73 deleted the fix/feed-validation-unblock branch May 24, 2026 14:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant