You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
feat(phase-1): enriched dashboard with cited-URL publish gate (#8)
Closes#3.
What ships
- src/ycai/dashboard.py: rewritten to take optional analyses=. With
analyses, renders 4 LLM-derived charts in addition to the YC
baseline:
- confidence breakdown (high/medium/low stacked bar)
- LLM industry distribution (excludes low-confidence rows)
- AI capability x industry heatmap
- tech stack signals
- OSS posture stacked bar with green-to-red color mapping
Each chart has a row-level drill-down via <details>.
- src/ycai/dashboard.py:collect_cited_urls + write_broken_links_report:
the publish-gate plumbing.
- src/ycai/cli.py: after enrichment, every URL cited in any analysis
is HEAD/GET-verified. If any return 4xx/5xx the dashboard is not
written and exit code 4 is returned. --allow-dead-links overrides
to write the dashboard with a loud banner plus a sidecar
BROKEN_LINKS.md naming each dead URL and the slugs that cited it.
- src/ycai/researcher.py:_drop_unknown_industries: lenient parsing
for industry_secondary only — primary stays strict. Models
occasionally emit reasonable-but-out-of-set categories like
'Productivity'; we drop those rather than failing the whole row.
- tests/test_dashboard.py: 14 new tests covering both modes,
coverage banner, dropped register, drill-downs, every OSS posture
value, and the publish-gate flow.
Live full-batch run on W26 (124 companies, ~6 min on subscription)
- 83 high / 41 low confidence (67%/33%)
- of low: 29 schema-validation failures, 12 honest model lows, 0
hallucinated source URLs (the source-URL guard caught zero — it
was unneeded on this run, but correctness invariant holds)
- top capability is agents at 54 (65% of high-confidence rows)
- 8 companies correctly classified no-ai despite being in YC
- OSS posture mostly 'unknown' (45) — model honestly admits gap
rather than guessing. Predicted by B007 in BACKLOG.
- 3 cited URLs were dead at publish time; surfaced in
BROKEN_LINKS-w26-2026-05-01.md, dashboard rendered with banner
via --allow-dead-links
Sample artifacts checked in:
- examples/output/dashboard-w26-enriched-2026-05-01.html
- examples/output/analyses-w26-full-2026-05-01.json
- examples/output/BROKEN_LINKS-w26-2026-05-01.md
Quality findings written up in docs/QUALITY_REPORT_W26.md.
B008 added to BACKLOG: tighten schema-validation rate from 23% via
either lenient ai_capability/tech_stack parsing or tool_use schema
enforcement on the API backend.
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy file name to clipboardExpand all lines: BACKLOG.md
+2-1Lines changed: 2 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -20,7 +20,8 @@ Promoted to GitHub issues when an item survives more than one PR. ADRs for non-t
20
20
-[B004] Tune `MIN_DESCRIPTION_CHARS` (currently 80). The W26 probe surfaced one borderline drop (`moda`, 57 chars). A small calibration study against borderline rows would let us pick a defensible threshold. — surfaced in: W26 quality probe — proposed: PR #2
21
21
-[B005] Name the missing-from-upstream companies, not just count them. Compare yc-oss slugs to a slug list discovered from `/companies/<slug>` profile pages so the dropped register includes "Acme (in YC W26 but not in yc-oss/api)". — surfaced in: W26 quality probe — proposed: PR #2 or #3
22
22
-[B006] Track schema-validation failure rate during enrichment as a tracked metric. The W26 smoke run had 1/5 (20%) parse failures (`velum-labs` — likely rationale exceeded the 400 char limit). Measure this across the full batch and tune prompt or schema if rate exceeds ~5%. — surfaced in: PR #2 smoke — proposed: PR #3
23
-
-[B007] Tech-stack and OSS-posture nearly always come back as `unknown` because the model only sees the YC `long_description`, not the company website. Adding a depth=1 website crawl before the LLM call would let the model identify e.g. "this product is closed-source SaaS" or "uses OpenAI" — significantly improving Tier A signal density. Cost: ~5-10 KB extra context per company. — surfaced in: PR #2 smoke — proposed: PR #3
23
+
-[B007] Tech-stack and OSS-posture nearly always come back as `unknown` because the model only sees the YC `long_description`, not the company website. Adding a depth=1 website crawl before the LLM call would let the model identify e.g. "this product is closed-source SaaS" or "uses OpenAI" — significantly improving Tier A signal density. Cost: ~5-10 KB extra context per company. — surfaced in: PR #2 smoke. Confirmed in PR #3 full run: 45 of 83 high-confidence rows have OSS posture `unknown`, 52 have tech_stack `unknown`. — proposed: PR after v0.1
24
+
-[B008] Schema-validation failure rate on the full W26 enrichment was **23%** (29 of 124). The lenient parser added in PR #3 only relaxed `industry_secondary`. Most remaining failures likely come from the model emitting `ai_capability` or `tech_stack` values outside our closed enums. Either extend the lenient parser to those fields, capture a sample of raw failed responses to audit, or introduce `tool_use`-style schema enforcement on the API backend so the model is constrained at decode time. — surfaced in: PR #3 full run — proposed: PR #4 (CLI polish)
Copy file name to clipboardExpand all lines: CHANGELOG.md
+2Lines changed: 2 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -14,5 +14,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
14
14
- First end-to-end probe on YC W26: 63.3% coverage of the official 196-company batch. Findings in `docs/QUALITY_REPORT_W26.md`.
15
15
- Phase 1 PR #2: LLM-based enrichment with anti-hallucination Layer 1 — pydantic-enforced output schema, source-URL guard against fabricated citations, two-pass cross-check on uncertain rows, sentinel low-confidence row on any failure. Three backends: `AgentSDKBackend` (subscription-default), `AnthropicAPIBackend` (`--api-key`), `MockBackend` (tests). 10 hallucination-trap fixtures locked in as regression tests.
16
16
- W26 enrichment smoke run (5 companies via subscription, 39s, ~free): 4 high / 1 low confidence. Identified `gru.space` as `no-ai` correctly. Schema-validation failure on `velum-labs` correctly fell through to the sentinel — no fabricated analysis served.
17
+
- Phase 1 PR #3: enriched dashboard. AI capability x industry heatmap, tech-stack distribution, OSS-posture breakdown, and confidence breakdown — all with row-level drill-downs. Cited-URL link-verify hard gate before any artifact ships (override via `--allow-dead-links` writes a `BROKEN_LINKS.md` sidecar and shows a warning banner). Lenient parsing for `industry_secondary` so the model can emit reasonable categories without tanking the row.
18
+
- W26 full-batch enrichment via subscription (124 companies, ~6 min, ~free): 83 high / 41 low confidence. Top finding: **65% of high-confidence W26 companies (54 of 83) build agents**. 8 companies correctly classified as `no-ai` (the trust signal). 3 cited URLs caught dead at publish time and surfaced via the publish gate.
-[B004] Tune `MIN_DESCRIPTION_CHARS`. 80 is a guess; a small calibration study against the 8 borderline companies would let us pick a defensible value.
107
107
-[B005] Add a "what's missing" section to the dashboard that compares yc-oss slugs to a slug list discovered from the YC `/companies/<slug>` profile pages, so we can name the 64 missing W26 companies, not just count them.
After PR #3 (enriched dashboard), the full 124-company enrichment ran end-to-end via Claude Max subscription. Took ~6 minutes.
114
+
115
+
### Confidence
116
+
117
+
-**83 high (67%)** + **0 medium** + **41 low (33%)**.
118
+
- Of the 41 low-confidence rows: **29** were schema-validation failures (model emitted output that didn't validate after lenient pass), **12** were genuinely-uncertain outputs the model itself flagged as low.
119
+
-**0 hallucinated source URLs** detected — the source-URL guard caught zero cases on this run; every cited URL traced back to either the company website or its YC profile page.
120
+
121
+
### Industry distribution (Tier A high+medium, n=83)
122
+
123
+
| Industry | n |
124
+
|---|---:|
125
+
| B2B SaaS | 16 |
126
+
| Fintech | 10 |
127
+
| Developer Tools | 7 |
128
+
| AI Infrastructure | 7 |
129
+
| Legal | 5 |
130
+
| Healthcare | 5 |
131
+
| Biotech | 4 |
132
+
| Security | 4 |
133
+
134
+
The B2B-heavy mix lines up with the [VCCorner W26 demo-day breakdown](https://www.thevccorner.com/p/yc-w26-demo-day-2026-complete-breakdown). The visible Legal cluster (5) is a smaller but real cohort the article didn't separately call out.
135
+
136
+
### AI capability distribution (n=83)
137
+
138
+
| Capability | n |
139
+
|---|---:|
140
+
|**agents**| 54 |
141
+
| nlp-classic | 30 |
142
+
| rag | 26 |
143
+
| data-pipeline | 19 |
144
+
| vision | 14 |
145
+
| multimodal | 10 |
146
+
| evals-observability | 9 |
147
+
|**no-ai**| 8 |
148
+
149
+
**Top finding**: 65% (54 of 83) of high-confidence W26 companies build agents. This is the dominant story of the batch.
150
+
151
+
**Honesty check**: 8 companies were correctly classified as `no-ai` despite being in the YC batch — the LLM is willing to say "the YC profile suggests AI but the description doesn't actually substantiate it." This is exactly the behavior the anti-hallucination contract is meant to produce.
152
+
153
+
### OSS posture (n=83)
154
+
155
+
| Posture | n |
156
+
|---|---:|
157
+
| unknown | 45 |
158
+
| closed | 36 |
159
+
| api-only | 1 |
160
+
| source-available | 1 |
161
+
| fully-open | 0 |
162
+
163
+
**The "unknown" plurality is the main signal**, and it's structural. The model has access only to the YC `long_description`; OSS posture is rarely stated there. **B007** in the backlog (depth=1 website crawl) would shift these `unknown` rows to `closed` / `api-only` / `weights-only` based on actual evidence (license files, GitHub presence, pricing pages).
164
+
165
+
Until then, do not over-interpret the `unknown` count: it's a measurement gap, not a finding.
166
+
167
+
### Tech stack
168
+
169
+
Dominated by `unknown` (52) and `custom-model` (13). Same structural reason — descriptions don't usually name the model provider. `custom-model` is signal-bearing: 13 companies advertise their own models / fine-tunes, which is a meaningful slice of W26.
170
+
171
+
### Cited-URL link verification (the publish gate)
172
+
173
+
Of all source URLs cited across 83 high-confidence rows, **3** returned 4xx/5xx at publish time:
Each is named in [`examples/output/BROKEN_LINKS-w26-2026-05-01.md`](../examples/output/BROKEN_LINKS-w26-2026-05-01.md) with the company that cited it. Dashboard rendered with `--allow-dead-links` for this example, with a warning banner at the top. In production runs (no `--allow-dead-links`), the pipeline would have refused to write the dashboard and exited non-zero — that's the publish gate.
179
+
180
+
### Implications
181
+
182
+
1.**Schema-validation failure rate (23%) is too high for a v0.1 release.** Tracked as B006. Most likely cause is the model emitting enum values outside our closed sets for `ai_capability` or `tech_stack` (we patched `industry_secondary` for this in PR #3 but the other two stayed strict). Fix in a follow-up PR.
183
+
2.**W26 is an agents batch.** This is now defensible — 54 of 83 high-confidence rows, with row-level drill-down showing exactly which companies and what their YC descriptions said.
184
+
3.**The 67% high-confidence rate against 63.3% upstream coverage means the actual analyzable share of W26 is ~42% (83/196).** The headline metric on the dashboard now shows this honestly.
Copy file name to clipboardExpand all lines: examples/README.md
+5-2Lines changed: 5 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -4,9 +4,12 @@ Sanitized sample artifacts. Every commit goes through `make publish-check` so PI
4
4
5
5
| File | What |
6
6
|---|---|
7
-
|[`output/dashboard-w26-2026-05-01.html`](output/dashboard-w26-2026-05-01.html)| Phase 1 dashboard for YC W26. Headline: 63.3% coverage of the 196-company batch, with the dropped register naming every excluded company. |
7
+
|[`output/dashboard-w26-enriched-2026-05-01.html`](output/dashboard-w26-enriched-2026-05-01.html)|**PR #3 full-batch dashboard.** Headline: 63.3% coverage of W26, with LLM-derived charts: AI capability x industry heatmap, tech-stack distribution, OSS-posture breakdown. Dead-link banner at top because 3 cited URLs returned 4xx/5xx at publish time. |
8
+
|[`output/dashboard-w26-2026-05-01.html`](output/dashboard-w26-2026-05-01.html)| PR #1 baseline (coverage-only mode, no LLM). Useful comparison for what shifts when --enrich is added. |
8
9
|[`output/coverage-w26-2026-05-01.json`](output/coverage-w26-2026-05-01.json)| Machine-readable coverage report — what feeds the dashboard. |
9
-
|[`output/analyses-w26-smoke-2026-05-01.json`](output/analyses-w26-smoke-2026-05-01.json)| PR #2 smoke run: 5-company LLM enrichment via Sonnet 4.6 on subscription. Captures the schema-enforced output and demonstrates source-URL grounding (every cited URL is from `website` or YC profile). |
10
+
|[`output/analyses-w26-full-2026-05-01.json`](output/analyses-w26-full-2026-05-01.json)|**PR #3 full-batch enrichment.** 124 companies × Sonnet 4.6, ~6 min on subscription. 83 high-confidence rows feed the charts; 41 low-confidence rows surface honestly in the methodology footer. |
11
+
|[`output/analyses-w26-smoke-2026-05-01.json`](output/analyses-w26-smoke-2026-05-01.json)| PR #2 smoke run: 5 companies, the original proof of life. |
12
+
|[`output/BROKEN_LINKS-w26-2026-05-01.md`](output/BROKEN_LINKS-w26-2026-05-01.md)| Sidecar from the full run. Names the 3 cited URLs that returned 4xx/5xx and the slugs that cited them. |
10
13
11
14
The full quality writeup for W26 is in [`docs/QUALITY_REPORT_W26.md`](../docs/QUALITY_REPORT_W26.md).
0 commit comments