Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion BACKLOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ Promoted to GitHub issues when an item survives more than one PR. ADRs for non-t
- [B005] Name the missing-from-upstream companies, not just count them. Compare yc-oss slugs to a slug list discovered from `/companies/<slug>` profile pages so the dropped register includes "Acme (in YC W26 but not in yc-oss/api)". — surfaced in: W26 quality probe — proposed: PR #2 or #3
- [B006] Track schema-validation failure rate during enrichment as a tracked metric. The W26 smoke run had 1/5 (20%) parse failures (`velum-labs` — likely rationale exceeded the 400 char limit). Measure this across the full batch and tune prompt or schema if rate exceeds ~5%. — surfaced in: PR #2 smoke — proposed: PR #3
- [B007] Tech-stack and OSS-posture nearly always come back as `unknown` because the model only sees the YC `long_description`, not the company website. Adding a depth=1 website crawl before the LLM call would let the model identify e.g. "this product is closed-source SaaS" or "uses OpenAI" — significantly improving Tier A signal density. Cost: ~5-10 KB extra context per company. — surfaced in: PR #2 smoke. Confirmed in PR #3 full run: 45 of 83 high-confidence rows have OSS posture `unknown`, 52 have tech_stack `unknown`. — proposed: PR after v0.1
- [B008] Schema-validation failure rate on the full W26 enrichment was **23%** (29 of 124). The lenient parser added in PR #3 only relaxed `industry_secondary`. Most remaining failures likely come from the model emitting `ai_capability` or `tech_stack` values outside our closed enums. Either extend the lenient parser to those fields, capture a sample of raw failed responses to audit, or introduce `tool_use`-style schema enforcement on the API backend so the model is constrained at decode time. — surfaced in: PR #3 full run — proposed: PR #4 (CLI polish)
- [B008] Schema-validation failure rate on the full W26 enrichment was **23%** (29 of 124). The lenient parser added in PR #3 only relaxed `industry_secondary`. Most remaining failures likely come from the model emitting `ai_capability` or `tech_stack` values outside our closed enums. Either extend the lenient parser to those fields, capture a sample of raw failed responses to audit, or introduce `tool_use`-style schema enforcement on the API backend so the model is constrained at decode time. — surfaced in: PR #3 full run — **resolved in PR #4** (rate now 0%; root cause was over-strict 400-char cap on `rationale`, fixed by truncate-not-reject; lenient parsing extended to `ai_capability` and `tech_stack`).

## Done

Expand Down
2 changes: 2 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,5 +16,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- W26 enrichment smoke run (5 companies via subscription, 39s, ~free): 4 high / 1 low confidence. Identified `gru.space` as `no-ai` correctly. Schema-validation failure on `velum-labs` correctly fell through to the sentinel — no fabricated analysis served.
- Phase 1 PR #3: enriched dashboard. AI capability x industry heatmap, tech-stack distribution, OSS-posture breakdown, and confidence breakdown — all with row-level drill-downs. Cited-URL link-verify hard gate before any artifact ships (override via `--allow-dead-links` writes a `BROKEN_LINKS.md` sidecar and shows a warning banner). Lenient parsing for `industry_secondary` so the model can emit reasonable categories without tanking the row.
- W26 full-batch enrichment via subscription (124 companies, ~6 min, ~free): 83 high / 41 low confidence. Top finding: **65% of high-confidence W26 companies (54 of 83) build agents**. 8 companies correctly classified as `no-ai` (the trust signal). 3 cited URLs caught dead at publish time and surfaced via the publish gate.
- Phase 1 PR #4: resilience + parser tightening. Lenient parsing extended to `ai_capability` (drop unknowns, fall back to `unclear`) and `tech_stack` (drop unknowns). `rationale` and `tagline_rewrite` truncate at the schema cap rather than fail the whole row. Raw failure capture (`raw_failures.jsonl`) for audit. Incremental writes to `analyses.jsonl` so partial state survives a crash. New `ycai resume <run-dir>` command resumes interrupted enrichment. New `ycai dashboard <run-dir>` re-renders the dashboard from existing artifacts at zero LLM cost. Live progress shows high/medium/low counts during enrichment.
- W26 full-batch re-run after PR #4: schema-validation failure rate dropped from 23% to **0%**. High-confidence rate went from 67% to **95%** (118 of 124). The "W26 is the agentic batch" finding strengthened to 58% of n=118. Quality writeup updated at `docs/QUALITY_REPORT_W26.md`.

[Unreleased]: https://github.com/RyanAlberts/yc-ai-pulse/compare/main...HEAD
59 changes: 59 additions & 0 deletions docs/QUALITY_REPORT_W26.md
Original file line number Diff line number Diff line change
Expand Up @@ -182,3 +182,62 @@ Each is named in [`examples/output/BROKEN_LINKS-w26-2026-05-01.md`](../examples/
1. **Schema-validation failure rate (23%) is too high for a v0.1 release.** Tracked as B006. Most likely cause is the model emitting enum values outside our closed sets for `ai_capability` or `tech_stack` (we patched `industry_secondary` for this in PR #3 but the other two stayed strict). Fix in a follow-up PR.
2. **W26 is an agents batch.** This is now defensible — 54 of 83 high-confidence rows, with row-level drill-down showing exactly which companies and what their YC descriptions said.
3. **The 67% high-confidence rate against 63.3% upstream coverage means the actual analyzable share of W26 is ~42% (83/196).** The headline metric on the dashboard now shows this honestly.

---

## PR #4 — schema-failure rate dropped to 0% (2026-05-01)

After PR #4 (resilience + parser tightening), full-batch enrichment metrics improved meaningfully on a fresh run:

| metric | PR #3 | PR #4 | change |
|---|---:|---:|---:|
| Total analyzed | 124 | 124 | – |
| **High confidence** | 83 (67%) | **118 (95%)** | **+35** |
| Schema-validation failures | 29 (23%) | **0 (0%)** | **-29** |
| Genuinely-uncertain model lows | 12 | 6 | -6 |
| Hallucinated source URLs | 0 | 0 | – |

**Root cause:** The 23% schema-validation failure rate in PR #3 was caused by the model emitting `rationale` fields longer than our 400-char `Field(max_length=400)` constraint. The model was being thorough; our schema was being unnecessarily strict on a non-load-bearing field. PR #4 changed the parser to truncate over-long `rationale` and `tagline_rewrite` rather than reject the row. Strict enforcement remains for load-bearing fields (`industry_primary`, `oss_posture`, `confidence`, `sources`).

The lenient extension to `ai_capability` and `tech_stack` (filter unknown values, fall back to `unclear` if all dropped) contributed only modest improvement on its own (~3-4%). The rationale truncation was the bigger win.

### W26 findings, recomputed on the n=118 high-confidence cohort

Capability distribution:

| Capability | n | share of n=118 |
|---|---:|---:|
| **agents** | 68 | **58%** |
| nlp-classic | 38 | 32% |
| rag | 34 | 29% |
| data-pipeline | 33 | 28% |
| multimodal | 17 | 14% |
| vision | 17 | 14% |
| inference-infra | 11 | 9% |
| evals-observability | 11 | 9% |
| training-infra | 10 | 8% |

The "W26 is the agentic batch" finding strengthens with more confident data: 58% of high-confidence companies build agents, up from 65% of 83 → 68 / 118. The absolute count is now larger and the cohort is broader.

Industry mix (top of n=118):

| Industry | n |
|---|---:|
| B2B SaaS | 28 |
| AI Infrastructure | 14 |
| Developer Tools | 12 |
| Fintech | 12 |
| Healthcare | 6 |
| Robotics | 6 |
| Consumer | 6 |
| Legal | 5 |

OSS posture is meaningfully different now: closed (48), unknown (65), api-only (3), source-available (1), fully-open (1). The `unknown` plurality remains because the model still lacks website-level evidence — `B007` (depth=1 crawl) is the next lever.

### Coverage of YC official, updated

- Upstream: 132 of 196 (67.3%) — unchanged
- Tier A+B: 124 of 132 — unchanged
- Tier A+B with high-confidence LLM analysis: **118 of 196 (60.2%)** — the most honest "what we actually know about W26" number.

The headline coverage % on the dashboard is unchanged at 63.3% (because that's the data-quality denominator). But for the deck/memo, the 60.2% number is what should be cited as "the share of W26 we can substantively classify."
10 changes: 6 additions & 4 deletions examples/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,12 +4,14 @@ Sanitized sample artifacts. Every commit goes through `make publish-check` so PI

| File | What |
|---|---|
| [`output/dashboard-w26-enriched-2026-05-01.html`](output/dashboard-w26-enriched-2026-05-01.html) | **PR #3 full-batch dashboard.** Headline: 63.3% coverage of W26, with LLM-derived charts: AI capability x industry heatmap, tech-stack distribution, OSS-posture breakdown. Dead-link banner at top because 3 cited URLs returned 4xx/5xx at publish time. |
| [`output/dashboard-w26-2026-05-01.html`](output/dashboard-w26-2026-05-01.html) | PR #1 baseline (coverage-only mode, no LLM). Useful comparison for what shifts when --enrich is added. |
| [`output/dashboard-w26-pr4-2026-05-01.html`](output/dashboard-w26-pr4-2026-05-01.html) | **PR #4 dashboard — current best.** 118 of 124 high-confidence (95%). Schema-failure rate now 0%. Top finding: 58% of n=118 build agents. |
| [`output/analyses-w26-pr4-2026-05-01.json`](output/analyses-w26-pr4-2026-05-01.json) | **PR #4 enrichment.** 124 companies, 0 schema failures, 6 genuine model lows, 0 hallucinated source URLs. |
| [`output/dashboard-w26-enriched-2026-05-01.html`](output/dashboard-w26-enriched-2026-05-01.html) | PR #3 dashboard, kept as before/after comparison (67% high-confidence, 23% schema failures). |
| [`output/dashboard-w26-2026-05-01.html`](output/dashboard-w26-2026-05-01.html) | PR #1 baseline (coverage-only mode, no LLM). |
| [`output/coverage-w26-2026-05-01.json`](output/coverage-w26-2026-05-01.json) | Machine-readable coverage report — what feeds the dashboard. |
| [`output/analyses-w26-full-2026-05-01.json`](output/analyses-w26-full-2026-05-01.json) | **PR #3 full-batch enrichment.** 124 companies × Sonnet 4.6, ~6 min on subscription. 83 high-confidence rows feed the charts; 41 low-confidence rows surface honestly in the methodology footer. |
| [`output/analyses-w26-full-2026-05-01.json`](output/analyses-w26-full-2026-05-01.json) | PR #3 full-batch enrichment. Kept for comparison. |
| [`output/analyses-w26-smoke-2026-05-01.json`](output/analyses-w26-smoke-2026-05-01.json) | PR #2 smoke run: 5 companies, the original proof of life. |
| [`output/BROKEN_LINKS-w26-2026-05-01.md`](output/BROKEN_LINKS-w26-2026-05-01.md) | Sidecar from the full run. Names the 3 cited URLs that returned 4xx/5xx and the slugs that cited them. |
| [`output/BROKEN_LINKS-w26-2026-05-01.md`](output/BROKEN_LINKS-w26-2026-05-01.md) | Sidecar from the PR #3 full run. Names dead cited URLs and the slugs that cited them. |

The full quality writeup for W26 is in [`docs/QUALITY_REPORT_W26.md`](../docs/QUALITY_REPORT_W26.md).

Expand Down
Loading
Loading