RyanAlberts · RyanAlberts · May 1, 2026 · May 1, 2026
diff --git a/BACKLOG.md b/BACKLOG.md
@@ -21,7 +21,7 @@ Promoted to GitHub issues when an item survives more than one PR. ADRs for non-t
 - [B005] Name the missing-from-upstream companies, not just count them. Compare yc-oss slugs to a slug list discovered from `/companies/<slug>` profile pages so the dropped register includes "Acme (in YC W26 but not in yc-oss/api)". — surfaced in: W26 quality probe — proposed: PR #2 or #3
 - [B006] Track schema-validation failure rate during enrichment as a tracked metric. The W26 smoke run had 1/5 (20%) parse failures (`velum-labs` — likely rationale exceeded the 400 char limit). Measure this across the full batch and tune prompt or schema if rate exceeds ~5%. — surfaced in: PR #2 smoke — proposed: PR #3
 - [B007] Tech-stack and OSS-posture nearly always come back as `unknown` because the model only sees the YC `long_description`, not the company website. Adding a depth=1 website crawl before the LLM call would let the model identify e.g. "this product is closed-source SaaS" or "uses OpenAI" — significantly improving Tier A signal density. Cost: ~5-10 KB extra context per company. — surfaced in: PR #2 smoke. Confirmed in PR #3 full run: 45 of 83 high-confidence rows have OSS posture `unknown`, 52 have tech_stack `unknown`. — proposed: PR after v0.1
-- [B008] Schema-validation failure rate on the full W26 enrichment was **23%** (29 of 124). The lenient parser added in PR #3 only relaxed `industry_secondary`. Most remaining failures likely come from the model emitting `ai_capability` or `tech_stack` values outside our closed enums. Either extend the lenient parser to those fields, capture a sample of raw failed responses to audit, or introduce `tool_use`-style schema enforcement on the API backend so the model is constrained at decode time. — surfaced in: PR #3 full run — proposed: PR #4 (CLI polish)
+- [B008] Schema-validation failure rate on the full W26 enrichment was **23%** (29 of 124). The lenient parser added in PR #3 only relaxed `industry_secondary`. Most remaining failures likely come from the model emitting `ai_capability` or `tech_stack` values outside our closed enums. Either extend the lenient parser to those fields, capture a sample of raw failed responses to audit, or introduce `tool_use`-style schema enforcement on the API backend so the model is constrained at decode time. — surfaced in: PR #3 full run — **resolved in PR #4** (rate now 0%; root cause was over-strict 400-char cap on `rationale`, fixed by truncate-not-reject; lenient parsing extended to `ai_capability` and `tech_stack`).
 
 ## Done
 

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -16,5 +16,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 - W26 enrichment smoke run (5 companies via subscription, 39s, ~free): 4 high / 1 low confidence. Identified `gru.space` as `no-ai` correctly. Schema-validation failure on `velum-labs` correctly fell through to the sentinel — no fabricated analysis served.
 - Phase 1 PR #3: enriched dashboard. AI capability x industry heatmap, tech-stack distribution, OSS-posture breakdown, and confidence breakdown — all with row-level drill-downs. Cited-URL link-verify hard gate before any artifact ships (override via `--allow-dead-links` writes a `BROKEN_LINKS.md` sidecar and shows a warning banner). Lenient parsing for `industry_secondary` so the model can emit reasonable categories without tanking the row.
 - W26 full-batch enrichment via subscription (124 companies, ~6 min, ~free): 83 high / 41 low confidence. Top finding: **65% of high-confidence W26 companies (54 of 83) build agents**. 8 companies correctly classified as `no-ai` (the trust signal). 3 cited URLs caught dead at publish time and surfaced via the publish gate.
+- Phase 1 PR #4: resilience + parser tightening. Lenient parsing extended to `ai_capability` (drop unknowns, fall back to `unclear`) and `tech_stack` (drop unknowns). `rationale` and `tagline_rewrite` truncate at the schema cap rather than fail the whole row. Raw failure capture (`raw_failures.jsonl`) for audit. Incremental writes to `analyses.jsonl` so partial state survives a crash. New `ycai resume <run-dir>` command resumes interrupted enrichment. New `ycai dashboard <run-dir>` re-renders the dashboard from existing artifacts at zero LLM cost. Live progress shows high/medium/low counts during enrichment.
+- W26 full-batch re-run after PR #4: schema-validation failure rate dropped from 23% to **0%**. High-confidence rate went from 67% to **95%** (118 of 124). The "W26 is the agentic batch" finding strengthened to 58% of n=118. Quality writeup updated at `docs/QUALITY_REPORT_W26.md`.
 
 [Unreleased]: https://github.com/RyanAlberts/yc-ai-pulse/compare/main...HEAD
diff --git a/docs/QUALITY_REPORT_W26.md b/docs/QUALITY_REPORT_W26.md
@@ -182,3 +182,62 @@ Each is named in [`examples/output/BROKEN_LINKS-w26-2026-05-01.md`](../examples/
 1. **Schema-validation failure rate (23%) is too high for a v0.1 release.** Tracked as B006. Most likely cause is the model emitting enum values outside our closed sets for `ai_capability` or `tech_stack` (we patched `industry_secondary` for this in PR #3 but the other two stayed strict). Fix in a follow-up PR.
 2. **W26 is an agents batch.** This is now defensible — 54 of 83 high-confidence rows, with row-level drill-down showing exactly which companies and what their YC descriptions said.
 3. **The 67% high-confidence rate against 63.3% upstream coverage means the actual analyzable share of W26 is ~42% (83/196).** The headline metric on the dashboard now shows this honestly.
+
+---
+
+## PR #4 — schema-failure rate dropped to 0% (2026-05-01)
+
+After PR #4 (resilience + parser tightening), full-batch enrichment metrics improved meaningfully on a fresh run:
+
+| metric | PR #3 | PR #4 | change |
+|---|---:|---:|---:|
+| Total analyzed | 124 | 124 | – |
+| **High confidence** | 83 (67%) | **118 (95%)** | **+35** |
+| Schema-validation failures | 29 (23%) | **0 (0%)** | **-29** |
+| Genuinely-uncertain model lows | 12 | 6 | -6 |
+| Hallucinated source URLs | 0 | 0 | – |
+
+**Root cause:** The 23% schema-validation failure rate in PR #3 was caused by the model emitting `rationale` fields longer than our 400-char `Field(max_length=400)` constraint. The model was being thorough; our schema was being unnecessarily strict on a non-load-bearing field. PR #4 changed the parser to truncate over-long `rationale` and `tagline_rewrite` rather than reject the row. Strict enforcement remains for load-bearing fields (`industry_primary`, `oss_posture`, `confidence`, `sources`).
+
+The lenient extension to `ai_capability` and `tech_stack` (filter unknown values, fall back to `unclear` if all dropped) contributed only modest improvement on its own (~3-4%). The rationale truncation was the bigger win.
+
+### W26 findings, recomputed on the n=118 high-confidence cohort
+
+Capability distribution:
+
+| Capability | n | share of n=118 |
+|---|---:|---:|
+| **agents** | 68 | **58%** |
+| nlp-classic | 38 | 32% |
+| rag | 34 | 29% |
+| data-pipeline | 33 | 28% |
+| multimodal | 17 | 14% |
+| vision | 17 | 14% |
+| inference-infra | 11 | 9% |
+| evals-observability | 11 | 9% |
+| training-infra | 10 | 8% |
+
+The "W26 is the agentic batch" finding strengthens with more confident data: 58% of high-confidence companies build agents, up from 65% of 83 → 68 / 118. The absolute count is now larger and the cohort is broader.
+
+Industry mix (top of n=118):
+
+| Industry | n |
+|---|---:|
+| B2B SaaS | 28 |
+| AI Infrastructure | 14 |
+| Developer Tools | 12 |
+| Fintech | 12 |
+| Healthcare | 6 |
+| Robotics | 6 |
+| Consumer | 6 |
+| Legal | 5 |
+
+OSS posture is meaningfully different now: closed (48), unknown (65), api-only (3), source-available (1), fully-open (1). The `unknown` plurality remains because the model still lacks website-level evidence — `B007` (depth=1 crawl) is the next lever.
+
+### Coverage of YC official, updated
+
+- Upstream: 132 of 196 (67.3%) — unchanged
+- Tier A+B: 124 of 132 — unchanged
+- Tier A+B with high-confidence LLM analysis: **118 of 196 (60.2%)** — the most honest "what we actually know about W26" number.
+
+The headline coverage % on the dashboard is unchanged at 63.3% (because that's the data-quality denominator). But for the deck/memo, the 60.2% number is what should be cited as "the share of W26 we can substantively classify."
diff --git a/examples/README.md b/examples/README.md
@@ -4,12 +4,14 @@ Sanitized sample artifacts. Every commit goes through `make publish-check` so PI
 
 | File | What |
 |---|---|
-| [`output/dashboard-w26-enriched-2026-05-01.html`](output/dashboard-w26-enriched-2026-05-01.html) | **PR #3 full-batch dashboard.** Headline: 63.3% coverage of W26, with LLM-derived charts: AI capability x industry heatmap, tech-stack distribution, OSS-posture breakdown. Dead-link banner at top because 3 cited URLs returned 4xx/5xx at publish time. |
-| [`output/dashboard-w26-2026-05-01.html`](output/dashboard-w26-2026-05-01.html) | PR #1 baseline (coverage-only mode, no LLM). Useful comparison for what shifts when --enrich is added. |
+| [`output/dashboard-w26-pr4-2026-05-01.html`](output/dashboard-w26-pr4-2026-05-01.html) | **PR #4 dashboard — current best.** 118 of 124 high-confidence (95%). Schema-failure rate now 0%. Top finding: 58% of n=118 build agents. |
+| [`output/analyses-w26-pr4-2026-05-01.json`](output/analyses-w26-pr4-2026-05-01.json) | **PR #4 enrichment.** 124 companies, 0 schema failures, 6 genuine model lows, 0 hallucinated source URLs. |
+| [`output/dashboard-w26-enriched-2026-05-01.html`](output/dashboard-w26-enriched-2026-05-01.html) | PR #3 dashboard, kept as before/after comparison (67% high-confidence, 23% schema failures). |
+| [`output/dashboard-w26-2026-05-01.html`](output/dashboard-w26-2026-05-01.html) | PR #1 baseline (coverage-only mode, no LLM). |
 | [`output/coverage-w26-2026-05-01.json`](output/coverage-w26-2026-05-01.json) | Machine-readable coverage report — what feeds the dashboard. |
-| [`output/analyses-w26-full-2026-05-01.json`](output/analyses-w26-full-2026-05-01.json) | **PR #3 full-batch enrichment.** 124 companies × Sonnet 4.6, ~6 min on subscription. 83 high-confidence rows feed the charts; 41 low-confidence rows surface honestly in the methodology footer. |
+| [`output/analyses-w26-full-2026-05-01.json`](output/analyses-w26-full-2026-05-01.json) | PR #3 full-batch enrichment. Kept for comparison. |
 | [`output/analyses-w26-smoke-2026-05-01.json`](output/analyses-w26-smoke-2026-05-01.json) | PR #2 smoke run: 5 companies, the original proof of life. |
-| [`output/BROKEN_LINKS-w26-2026-05-01.md`](output/BROKEN_LINKS-w26-2026-05-01.md) | Sidecar from the full run. Names the 3 cited URLs that returned 4xx/5xx and the slugs that cited them. |
+| [`output/BROKEN_LINKS-w26-2026-05-01.md`](output/BROKEN_LINKS-w26-2026-05-01.md) | Sidecar from the PR #3 full run. Names dead cited URLs and the slugs that cited them. |
 
 The full quality writeup for W26 is in [`docs/QUALITY_REPORT_W26.md`](../docs/QUALITY_REPORT_W26.md).