Skip to content

Commit c2c280c

Browse files
authored
fix(feeds): unblock feed-validation workflow (split SSRF-drift from third-party state) (#3885)
* fix(feeds): unblock feed-validation workflow The Feed Validation workflow (.github/workflows/feed-validation.yml) has been failing 100% of runs since it landed in PR #3872 — every push to main + every scheduled 6h run. Five root causes, all addressed here: 1. fast-xml-parser default entity-expansion limit was tripping on legitimate large feeds (Guardian, Fox, Axios, CISA, WHO, MIT, Defense One, Folha, El País, iefimerida, GitHub Trending, Dev.to, Oryx OSINT, …). We only read date strings from the parsed doc, so processEntities:false is safe and recovers all 17 false-positive DEAD rows. 2. 10 hosts referenced from src/config/feeds.ts were absent from the 5-file allowlist mirror set (shared/rss-allowed-domains.{json,cjs}, scripts/shared/rss-allowed-domains.json, api/_rss-allowed-domains.js, vite.config.ts:RSS_PROXY_ALLOWED_DOMAINS). Added: abcnews.go.com + abcnews.com (feeds.abcnews.com → abcnews.go.com → abcnews.com two-hop chain), www.corriere.it, www.rt.com, www.alarabiya.net, tuoitrenews.vn, www.yonhapnewstv.co.kr, www.chosun.com, rss.libsyn.com, feeds.megaphone.fm, rss.art19.com. The same allowlist gates the prod Edge rss-proxy, so this also silently restores access to these feeds for live users. 3. BBC Persian was declared as plaintext http://, rejected by the --ci https-only guard. Updated to the canonical https://feeds.bbci.co.uk/persian/rss.xml (server-side mirror already had this). 4. Tom's Hardware /feeds/all redirects to http://… on the same host, tripping the per-hop https guard. The canonical https path is /feeds.xml — switched both client and server mirrors. 5. Validator was hard-failing on any STALE-or-DEAD row, which made the workflow noise floor unbearable: 8 stale + 32 dead = 40 reasons to be red, of which only 10 were actionable. Split the exit policy: HARD-FAIL on config/SSRF-guard drift (allowlist miss, plaintext URL, redirect loop) so future drift is loud, SOFT-FAIL (exit 0 with warn) on third-party 4xx/timeouts/STALE so a feed disappearing upstream doesn't page anyone. Promoting third-party failures to hard-fail can wait for a registry grooming PR. Also bumps the scheduled cadence from every-6h to daily-00:00-UTC. 4× the discovery rate added zero value — feed outages don't change faster than once-a-day, and 542 feeds × 4 runs/day was wasted runner-minutes and third-party fetch volume. Local validator result (after the fix): Summary: 512 OK, 10 stale, 6 dead, 13 empty, 1 skipped Exit: 0 (no config drift). 6 remaining DEAD are all genuine third-party state (Brasil Paralelo 404, EIA Reports 404 [duplicate entry], News24 403, Tuoi Tre + Al Arabiya unreachable from this network) — candidates for a future registry-cleanup PR. Test coverage: tests/feeds-client-server-parity.test.mjs, tests/feed-resolution.test.mts, tests/feeds-time-gate.test.mts — all green. Full test:data suite — green. * fix(feeds): address Greptile P1 + P2 on validate-rss-feeds.mjs P1: FAIL message claimed "4 allowlist mirrors" but the codebase enforces 5 (scripts/shared/rss-allowed-domains.json was missing from the list). A developer following the message would skip that mirror and hit a puzzling secondary failure from the `test:data` scripts/shared parity check. Listed all 5 mirrors. P2: The isConfigDrift predicate was prefix-matching against literal copies of the error message strings thrown by assertCiAllowed and fetchFeed. A future reword at either throwing site (e.g. dropping the "(--ci)" annotation, rewording the redirect error) would silently reclassify config drift as third-party rot, demoting a hard fail to a soft warning. Centralised the sentinel strings as a frozen CONFIG_DRIFT_REASONS object that both the throwing sites and the classifier consume — rename a reason and BOTH consumers move in lockstep. INVALID_URL is also now properly classified as config drift (was previously falling through to soft-fail despite being an actor- fixable bug in feeds.ts). Tested: - End-to-end run: 512 OK / 9 stale / 7 dead / 13 empty, EXIT=0 - Classifier unit test: all 8 representative cases correct (4 config-drift reasons → true, 4 third-party reasons → false) * fix(feeds): allowlist idp.nature.com (Nature SSO redirect hop) Nature publishes a session/IP-conditional redirect chain on feeds.nature.com — on some networks the request lands directly at www.nature.com/nature.rss (both already in the allowlist), but on others (apparently GitHub Actions runner IPs) Nature inserts an idp.nature.com SSO/identity-provider hop: feeds.nature.com → idp.nature.com → www.nature.com/nature.rss The validator's per-hop allowlist re-check fails on idp.nature.com. Adding it to all 4 hand-maintained mirrors (+ the .cjs that auto-syncs via require) closes the gap. Same shape as the abcnews.go.com fix on the original PR — the lesson is that allowlist audits done from a developer laptop can miss intermediate redirect hops that only appear under different network egress paths. Documented in worldmonitor-architecture-gotchas/.../multi-hop-redirect- chain-needs-every-host-in-allowlist.md (skill added in PR #3885 first round). Also addresses the reviewer's second P1 finding (Invalid URL being soft-fail instead of hard-fail): already fixed in 8a34058 — the reviewer's audit was against the pre-Greptile commit cbb80e1. INVALID_URL is now in CONFIG_DRIFT_REASONS and isConfigDrift hard-fails malformed registry entries.
1 parent 3735d77 commit c2c280c

8 files changed

Lines changed: 106 additions & 15 deletions

File tree

.github/workflows/feed-validation.yml

Lines changed: 6 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -7,8 +7,11 @@ name: Feed Validation
77
#
88
# Triggers:
99
# - push to main: catches drift introduced by merged registry edits
10-
# - schedule (every 6h): catches third-party feed outages on a cadence
11-
# operators can act on without staring at PR checks
10+
# - schedule (daily 00:00 UTC): catches third-party feed outages on a cadence
11+
# operators can act on without staring at PR checks. Earlier 6h cadence
12+
# was 4× the necessary discovery rate — feed outages don't change that
13+
# fast and 542 feeds × 4 runs/day was wasted runner-minutes + third-
14+
# party-fetch volume that no one acted on.
1215
# - workflow_dispatch: manual re-runs from the Actions UI
1316
#
1417
# The --ci flag enforces three guardrails inside scripts/validate-rss-feeds.mjs:
@@ -27,7 +30,7 @@ on:
2730
- 'shared/rss-allowed-domains.json'
2831
- '.github/workflows/feed-validation.yml'
2932
schedule:
30-
- cron: '0 */6 * * *'
33+
- cron: '0 0 * * *'
3134
workflow_dispatch:
3235

3336
permissions:

api/_rss-allowed-domains.js

Lines changed: 13 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -311,5 +311,17 @@ export default [
311311
"hirado.hu",
312312
"portfolio.hu",
313313
"www.portfolio.hu",
314-
"www.atv.hu"
314+
"www.atv.hu",
315+
"abcnews.go.com",
316+
"abcnews.com",
317+
"www.corriere.it",
318+
"www.rt.com",
319+
"www.alarabiya.net",
320+
"tuoitrenews.vn",
321+
"www.yonhapnewstv.co.kr",
322+
"www.chosun.com",
323+
"rss.libsyn.com",
324+
"feeds.megaphone.fm",
325+
"rss.art19.com",
326+
"idp.nature.com"
315327
];

scripts/shared/rss-allowed-domains.json

Lines changed: 13 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -308,5 +308,17 @@
308308
"hirado.hu",
309309
"portfolio.hu",
310310
"www.portfolio.hu",
311-
"www.atv.hu"
311+
"www.atv.hu",
312+
"abcnews.go.com",
313+
"abcnews.com",
314+
"www.corriere.it",
315+
"www.rt.com",
316+
"www.alarabiya.net",
317+
"tuoitrenews.vn",
318+
"www.yonhapnewstv.co.kr",
319+
"www.chosun.com",
320+
"rss.libsyn.com",
321+
"feeds.megaphone.fm",
322+
"rss.art19.com",
323+
"idp.nature.com"
312324
]

scripts/validate-rss-feeds.mjs

Lines changed: 54 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,18 @@ const FETCH_TIMEOUT = 15_000;
1414
const CONCURRENCY = 10;
1515
const STALE_DAYS = 30;
1616

17+
// Sentinel error-message prefixes for the SSRF/config guardrails. Centralised so
18+
// the throwing sites (assertCiAllowed, fetchFeed) and the isConfigDrift
19+
// classifier can never drift apart — rename a reason, BOTH consumers update in
20+
// lockstep. Without this, an innocuous reword (e.g. dropping `(--ci)`) would
21+
// silently reclassify hard failures as soft warnings.
22+
const CONFIG_DRIFT_REASONS = Object.freeze({
23+
INVALID_URL: 'Invalid URL',
24+
NON_HTTPS: 'Non-https scheme rejected in --ci mode:',
25+
HOST_NOT_ALLOWED: 'Host not in allowlist (--ci):',
26+
TOO_MANY_REDIRECTS: 'Too many redirects',
27+
});
28+
1729
// --ci flag hardens the validator for trusted-context CI runs (push-to-main
1830
// + schedule workflow). NOT enabled in PR CI — PR CI never runs this script
1931
// because PR contributors can rewrite feeds.ts to make GitHub runners hit
@@ -95,13 +107,13 @@ function assertCiAllowed(rawUrl) {
95107
try {
96108
parsed = new URL(rawUrl);
97109
} catch {
98-
throw new Error('Invalid URL');
110+
throw new Error(CONFIG_DRIFT_REASONS.INVALID_URL);
99111
}
100112
if (parsed.protocol !== 'https:') {
101-
throw new Error(`Non-https scheme rejected in --ci mode: ${parsed.protocol}`);
113+
throw new Error(`${CONFIG_DRIFT_REASONS.NON_HTTPS} ${parsed.protocol}`);
102114
}
103115
if (!isAllowedDomain(parsed.hostname)) {
104-
throw new Error(`Host not in allowlist (--ci): ${parsed.hostname}`);
116+
throw new Error(`${CONFIG_DRIFT_REASONS.HOST_NOT_ALLOWED} ${parsed.hostname}`);
105117
}
106118
return parsed;
107119
}
@@ -144,7 +156,7 @@ async function fetchFeed(url) {
144156
// the headers handshake is what we wanted bounded per hop.
145157
return await resp.text();
146158
}
147-
throw new Error('Too many redirects');
159+
throw new Error(CONFIG_DRIFT_REASONS.TOO_MANY_REDIRECTS);
148160
}
149161
const controller = new AbortController();
150162
const timer = setTimeout(() => controller.abort(), FETCH_TIMEOUT);
@@ -162,7 +174,10 @@ async function fetchFeed(url) {
162174
}
163175

164176
function parseNewestDate(xml) {
165-
const parser = new XMLParser({ ignoreAttributes: false });
177+
// processEntities:false — we only read date strings, never decode entity-bearing content.
178+
// fast-xml-parser v5's default entity-expansion threshold trips on legit large feeds
179+
// (Guardian, Fox, Axios, CISA, WHO, MIT, …) and produces false-positive DEAD rows.
180+
const parser = new XMLParser({ ignoreAttributes: false, processEntities: false });
166181
const doc = parser.parse(xml);
167182

168183
const dates = [];
@@ -295,7 +310,40 @@ async function main() {
295310
console.log(`Summary: ${ok.length} OK, ${stale.length} stale, ${dead.length} dead, ${empty.length} empty` +
296311
(skipped.length ? `, ${skipped.length} skipped` : ''));
297312

298-
if (stale.length || dead.length) process.exit(1);
313+
// Exit policy:
314+
// HARD-FAIL on config/SSRF-guard drift — these are bugs the maintainer can fix.
315+
// Reasons enumerated in CONFIG_DRIFT_REASONS (top of file). Both the throwing
316+
// sites and this classifier consume the same constants so a future reword
317+
// can't silently demote a hard fail to a warning.
318+
// SOFT-FAIL (exit 0 with warning) on third-party state — third-party 4xx/timeouts,
319+
// STALE feeds, EMPTY feeds. These rot naturally; failing the build on them
320+
// produces 100% CI noise and the prior workflow proved no one acts on it.
321+
// Promoting third-party failures to hard-fail requires a registry-cleanup PR
322+
// first; revisit once the long tail is groomed.
323+
const isConfigDrift = (r) =>
324+
typeof r.detail === 'string' &&
325+
Object.values(CONFIG_DRIFT_REASONS).some(prefix => r.detail.startsWith(prefix));
326+
const configDrift = dead.filter(isConfigDrift);
327+
const thirdPartyDead = dead.filter(r => !isConfigDrift(r));
328+
329+
if (configDrift.length) {
330+
console.error(
331+
`\nFAIL: ${configDrift.length} feed(s) violate the CI guardrails ` +
332+
`(allowlist drift or plaintext URL). Fix src/config/feeds.ts and/or the 5 ` +
333+
`allowlist mirrors (shared/rss-allowed-domains.json, .cjs, ` +
334+
`scripts/shared/rss-allowed-domains.json, ` +
335+
`api/_rss-allowed-domains.js, vite.config.ts:RSS_PROXY_ALLOWED_DOMAINS).`
336+
);
337+
process.exit(1);
338+
}
339+
340+
if (stale.length || thirdPartyDead.length || empty.length) {
341+
console.warn(
342+
`\nWARN: ${thirdPartyDead.length} third-party dead, ${stale.length} stale, ` +
343+
`${empty.length} empty. Third-party state — not a build failure. ` +
344+
`Groom src/config/feeds.ts when the count crosses a threshold worth a PR.`
345+
);
346+
}
299347
}
300348

301349
main().catch(err => {

server/worldmonitor/news/v1/_feeds.ts

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -246,7 +246,7 @@ export const VARIANT_FEEDS: Record<string, Record<string, ServerFeed[]>> = {
246246
{ name: 'Product Hunt', url: 'https://www.producthunt.com/feed' },
247247
],
248248
hardware: [
249-
{ name: "Tom's Hardware", url: 'https://www.tomshardware.com/feeds/all' },
249+
{ name: "Tom's Hardware", url: 'https://www.tomshardware.com/feeds.xml' },
250250
{ name: 'SemiAnalysis', url: 'https://www.semianalysis.com/feed' },
251251
{ name: 'Semiconductor News', url: gn('semiconductor OR chip OR TSMC OR NVIDIA OR Intel when:3d') },
252252
],

shared/rss-allowed-domains.json

Lines changed: 13 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -308,5 +308,17 @@
308308
"hirado.hu",
309309
"portfolio.hu",
310310
"www.portfolio.hu",
311-
"www.atv.hu"
311+
"www.atv.hu",
312+
"abcnews.go.com",
313+
"abcnews.com",
314+
"www.corriere.it",
315+
"www.rt.com",
316+
"www.alarabiya.net",
317+
"tuoitrenews.vn",
318+
"www.yonhapnewstv.co.kr",
319+
"www.chosun.com",
320+
"rss.libsyn.com",
321+
"feeds.megaphone.fm",
322+
"rss.art19.com",
323+
"idp.nature.com"
312324
]

src/config/feeds.ts

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -295,7 +295,7 @@ const FULL_FEEDS: Record<string, Feed[]> = {
295295
{ name: 'Al Arabiya', url: { en: rss('https://news.google.com/rss/search?q=site:english.alarabiya.net+when:2d&hl=en-US&gl=US&ceid=US:en'), ar: rss('https://www.alarabiya.net/tools/mrss/?cat=main') } },
296296
// Arab News and Times of Israel removed — 403 from cloud IPs
297297
{ name: 'Guardian ME', url: rss('https://www.theguardian.com/world/middleeast/rss') },
298-
{ name: 'BBC Persian', url: rss('http://feeds.bbci.co.uk/persian/tv-and-radio-37434376/rss.xml') },
298+
{ name: 'BBC Persian', url: rss('https://feeds.bbci.co.uk/persian/rss.xml') },
299299
{ name: 'Iran International', url: rss('https://news.google.com/rss/search?q=site:iranintl.com+when:2d&hl=en-US&gl=US&ceid=US:en') },
300300
{ name: 'Fars News', url: rss('https://news.google.com/rss/search?q=site:farsnews.ir+when:2d&hl=en-US&gl=US&ceid=US:en') },
301301
{ name: 'IRNA', url: rss('https://en.irna.ir/rss') },
@@ -623,7 +623,7 @@ const TECH_FEEDS: Record<string, Feed[]> = {
623623
{ name: 'Seeking Alpha Tech', url: rss('https://seekingalpha.com/market_currents.xml') },
624624
],
625625
hardware: [
626-
{ name: "Tom's Hardware", url: rss('https://www.tomshardware.com/feeds/all') },
626+
{ name: "Tom's Hardware", url: rss('https://www.tomshardware.com/feeds.xml') },
627627
{ name: 'SemiAnalysis', url: rss('https://news.google.com/rss/search?q=site:semianalysis.com+when:7d&hl=en-US&gl=US&ceid=US:en') },
628628
{ name: 'Semiconductor News', url: rss('https://news.google.com/rss/search?q=semiconductor+OR+chip+OR+TSMC+OR+NVIDIA+OR+Intel+when:3d&hl=en-US&gl=US&ceid=US:en') },
629629
],

vite.config.ts

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -573,6 +573,10 @@ const RSS_PROXY_ALLOWED_DOMAINS = new Set([
573573
'www.goodnewsnetwork.org', 'www.positive.news', 'reasonstobecheerful.world',
574574
'www.optimistdaily.com', 'www.sunnyskyz.com', 'www.huffpost.com',
575575
'www.sciencedaily.com', 'feeds.nature.com', 'www.livescience.com', 'www.newscientist.com',
576+
// Feed-registry coverage (PR fix/feed-validation-unblock — kept sync with shared/rss-allowed-domains.json)
577+
'abcnews.go.com', 'abcnews.com', 'www.corriere.it', 'www.rt.com', 'www.alarabiya.net', 'tuoitrenews.vn',
578+
'www.yonhapnewstv.co.kr', 'www.chosun.com', 'rss.libsyn.com', 'feeds.megaphone.fm', 'rss.art19.com',
579+
'idp.nature.com',
576580
]);
577581

578582
function rssProxyPlugin(): Plugin {

0 commit comments

Comments
 (0)