Skip to content

Commit ceed52e

Browse files
RyanAlbertsclaude
andauthored
feat(phase-1): scraper + sanitizer + coverage probe with W26 quality report (#6)
* feat(phase-1): scraper + sanitizer + coverage probe with W26 quality report Phase 1 PR #1: ships the data-quality floor before any LLM cost is incurred. What this lands - src/ycai/schemas.py: pydantic models — RawCompany, CoverageRecord, BatchCoverage, CoverageTier, DropReason. Single source of truth for what a company looks like at every pipeline stage. - src/ycai/scraper.py: yc-oss/api as the only sanctioned source per ADR 0001. Hard-fails on unreachable upstream — no fallback to the robots.txt-disallowed `ycombinator.com/companies?batch=...` URL. - src/ycai/sanitizer.py: defensive PII strip (email, phone, address, API keys) before any data hits disk or the LLM. - src/ycai/coverage.py: tier classifier (A/B/C) + dropped register. Coverage = (Tier A + Tier B) / total. - src/ycai/verifier.py: async link-checker, HEAD with GET fallback. - src/ycai/dashboard.py: single-file HTML output. Headline metric is coverage; the dropped register is rendered before any chart so quality issues are unmissable. No CDN, opens offline. - src/ycai/cli.py: `ycai run-coverage` wires it together. Quality probe — the user's feature request The coverage probe acknowledges every dropped company and the specific reason (no quiet drops). Two coverage % numbers: vs. upstream, and vs. known YC-official count. The latter is the headline. W26 first run: 63.3% coverage of the 196-company batch. 64 companies missing from yc-oss/api due to upstream staleness (last refreshed 2026-02-08); 8 dropped for missing fields (named in the register); 4 dead websites (kept as Tier B with a flag). Findings in docs/QUALITY_REPORT_W26.md and the sanitized example dashboard at examples/output/dashboard-w26-2026-05-01.html. Hygiene - 41 tests pass (sanitizer, scraper, coverage, smoke). - Pre-commit + publish-check green. - Test fixtures with intentional fake API keys gated by inline pragma + script exclusions so we keep credential blocking strict for everything else. - Two new BACKLOG entries: B004 (description threshold tuning), B005 (name the missing-from-upstream companies). Closes #1. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(phase-1): satisfy mypy --strict + UP038 isinstance modernization Mypy --strict in CI flagged 9 errors not visible without strict locally. Fixes: - scraper.py: type-narrow dict.get() results before int/str/parse_iso - dashboard.py: explicit Counter[str] annotations - cli.py: import RawCompany for _write_csv concrete signature - sanitizer.py: drop unused type:ignore mypy 1.20 rejects - isinstance(x, (int, str)) -> isinstance(x, int | str) (UP038) 41 tests green, ruff clean, mypy --strict clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 59f937d commit ceed52e

21 files changed

Lines changed: 3035 additions & 26 deletions

.secrets.baseline

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -127,5 +127,5 @@
127127
}
128128
],
129129
"results": {},
130-
"generated_at": "2026-05-01T17:09:18Z"
130+
"generated_at": "2026-05-01T19:00:38Z"
131131
}

BACKLOG.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,8 @@ Promoted to GitHub issues when an item survives more than one PR. ADRs for non-t
1717
- [B001] yc-oss/api is now sole source for batch listing — the previously planned `ycombinator.com/companies?batch=...` fallback is disallowed by robots.txt. PR #1 must implement a hard-fail path when yc-oss is unreachable, plus an upstream-staleness CI cron. — surfaced in: phase 0 verification — proposed: PR #1
1818
- [B002] Confirm Cloudflare or upstream caching on `yc-oss.github.io/api/*` for our use case (rate limit headroom on full-batch sweeps). — surfaced in: phase 0 — proposed: PR #1
1919
- [B003] CI annotations report Node 20 actions deprecated (forced to Node 24 from 2026-06-02). Refresh `actions/checkout`, `actions/setup-python`, `gitleaks/gitleaks-action` to Node-24-compatible majors before that date. — surfaced in: phase 0 CI run — proposed: ad-hoc PR before 2026-06-02
20+
- [B004] Tune `MIN_DESCRIPTION_CHARS` (currently 80). The W26 probe surfaced one borderline drop (`moda`, 57 chars). A small calibration study against borderline rows would let us pick a defensible threshold. — surfaced in: W26 quality probe — proposed: PR #2
21+
- [B005] Name the missing-from-upstream companies, not just count them. Compare yc-oss slugs to a slug list discovered from `/companies/<slug>` profile pages so the dropped register includes "Acme (in YC W26 but not in yc-oss/api)". — surfaced in: W26 quality probe — proposed: PR #2 or #3
2022

2123
## Done
2224

CHANGELOG.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,5 +9,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
99

1010
### Added
1111
- Phase 0 bootstrap: MIT license, repo scaffolding, pre-commit + secret-scan, CI workflow, BACKLOG discipline, first ADR.
12+
- Phase 1 PR #1: yc-oss/api scraper, PII sanitizer, link verifier, coverage probe, single-file dashboard, Typer CLI (`ycai run-coverage`).
13+
- Coverage metric is the dashboard headline. The dropped register acknowledges every excluded company and the specific reason — no quiet drops.
14+
- First end-to-end probe on YC W26: 63.3% coverage of the official 196-company batch. Findings in `docs/QUALITY_REPORT_W26.md`.
1215

1316
[Unreleased]: https://github.com/RyanAlberts/yc-ai-pulse/compare/main...HEAD

docs/QUALITY_REPORT_W26.md

Lines changed: 107 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,107 @@
1+
# W26 quality probe — 2026-05-01
2+
3+
First end-to-end run of the Phase 1 quality probe. No LLM calls; this is the
4+
data-quality floor against which classification + report generation will run
5+
in subsequent PRs.
6+
7+
## Headline
8+
9+
**63.3% of YC W26 analyzed** — 124 of 196 companies pass the data-quality bar.
10+
11+
## Coverage breakdown
12+
13+
| Source | Count | Notes |
14+
|---|---:|---|
15+
| YC W26 official (Demo Day, 2026-03-24) | 196 | Per the [VC Corner W26 breakdown](https://www.thevccorner.com/p/yc-w26-demo-day-2026-complete-breakdown). |
16+
| yc-oss/api fixture (last refreshed 2026-02-08) | 132 | 64 companies missing — upstream is stale by ~3 months. |
17+
| Tier A (full classification) | 120 | All required fields + website returned 2xx/3xx. |
18+
| Tier B (partial — website unreachable) | 4 | Required fields present; website 4xx/5xx. Kept in charts with a flag. |
19+
| Tier C (excluded) | 8 | Acknowledged in the dropped register below. |
20+
| **Analyzable (A + B)** | **124** | Feeds every chart in the dashboard. |
21+
22+
**Coverage of upstream:** 93.9% (124 / 132).
23+
**Coverage of YC official:** 63.3% (124 / 196). ← **headline metric**
24+
25+
## Why the gap
26+
27+
### 1. Upstream staleness (the bigger problem — 64 companies)
28+
29+
`yc-oss/api`'s `meta.json` reports `last_updated: 2026-02-08T01:49:11Z`. W26 Demo Day was **2026-03-24**, so the upstream was last refreshed ~6 weeks before the batch closed. The Demo Day–era cohort (~64 companies) is missing from the feed entirely.
30+
31+
This is not a bug in `yc-ai-pulse``yc-oss/api` is community-maintained. Mitigations:
32+
33+
1. **Already in place:** the dashboard surfaces this gap upfront ("Upstream gap" alert banner).
34+
2. **B003 (open in BACKLOG):** add a CI cron that warns if the upstream is >48h stale. The W26 case would have tripped it ~3 months ago.
35+
3. **Future:** consider a direct YC profile-page enrichment (allowed under robots.txt for `/companies/<slug>`) for slug lists discovered from elsewhere. Not in v0.1 scope.
36+
37+
### 2. Per-company drops (8 companies)
38+
39+
Eight companies in the upstream feed were excluded from charts because they're missing fields the analysis layer requires. They are listed by name:
40+
41+
| Slug | Name | Reason |
42+
|---|---|---|
43+
| `protent` | Protent | `long_description` empty |
44+
| `byteport` | Byteport | `long_description` empty |
45+
| `zerosettle` | ZeroSettle | `long_description` empty |
46+
| `traverse` | Traverse | `long_description` empty |
47+
| `grade` | Grade | `long_description` empty |
48+
| `zymbly` | Zymbly | `long_description` empty |
49+
| `moda` | Moda | `long_description` 57 chars (below 80-char threshold) |
50+
| `condor-energy` | Condor Energy | `website` field empty |
51+
52+
Auditable threshold: `MIN_DESCRIPTION_CHARS = 80` ([src/ycai/coverage.py](../src/ycai/coverage.py)). Lowering it to 50 would bring `moda` back; raising it to 120 would drop ~6 more borderline rows. The current threshold balances inclusion with the requirement that classification be evidence-backed.
53+
54+
### 3. Dead websites (4 companies — kept as Tier B)
55+
56+
Tier B keeps these companies in the analysis but flags them in the dashboard:
57+
58+
- `maywood` — Maywood
59+
- `caretta` — Caretta
60+
- `arzule` — Arzule
61+
- `servo7` — Servo7
62+
63+
These had 4xx/5xx responses at probe time. Could be transient. The verifier reruns at report build time (PR #3 acceptance gate).
64+
65+
## What we already know about the analyzable 124
66+
67+
Industry distribution (from the YC-supplied `industry` field, no LLM yet):
68+
69+
| Industry | Count |
70+
|---|---:|
71+
| B2B | 80 |
72+
| Industrials | 18 |
73+
| Healthcare | 9 |
74+
| Fintech | 8 |
75+
| Consumer | 6 |
76+
| Real Estate and Construction | 3 |
77+
78+
The B2B-heavy distribution lines up with the [thevccorner.com breakdown](https://www.thevccorner.com/p/yc-w26-demo-day-2026-complete-breakdown) (64% B2B for W26). Internal consistency check passes.
79+
80+
## Verifier results
81+
82+
- `ok` (2xx/3xx): **127** websites
83+
- `dead` (4xx/5xx): **4** websites
84+
- `slow` (>5s): 0
85+
- `redirect` (>3 hops): 0
86+
- `error` (network): 0
87+
88+
## Reproducing this run
89+
90+
```bash
91+
PYTHONPATH=src python3 -m ycai.cli run-coverage \
92+
--batch winter-2026 \
93+
--yc-official-count 196
94+
```
95+
96+
Output: `runs/2026-05-01-185520/{dashboard.html, coverage.json, companies.csv}`.
97+
98+
## Implications for downstream PRs
99+
100+
- **PR #2 (researcher + classifier):** must consume `coverage.json` directly so its denominator agrees with the dashboard. The LLM never sees Tier C rows.
101+
- **PR #3 (deck/memo):** the methodology slide must show the same 63.3% headline, same upstream-gap callout, same dropped-register table. CI should fail if the deck cites a different denominator.
102+
- **PR #5 (release):** consider adding a "data freshness" indicator to the README badge so users know if the latest cached run is from a stale upstream.
103+
104+
## Open follow-ups (added to BACKLOG)
105+
106+
- [B004] Tune `MIN_DESCRIPTION_CHARS`. 80 is a guess; a small calibration study against the 8 borderline companies would let us pick a defensible value.
107+
- [B005] Add a "what's missing" section to the dashboard that compares yc-oss slugs to a slug list discovered from the YC `/companies/<slug>` profile pages, so we can name the 64 missing W26 companies, not just count them.

examples/README.md

Lines changed: 7 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,12 @@
11
# Example outputs
22

3-
Sanitized sample artifacts will be checked in here as Phases 1 and 2 land:
3+
Sanitized sample artifacts. Every commit goes through `make publish-check` so PII can't slip in.
44

5-
- `output/dashboard.html` — full interactive dashboard (Phase 1)
6-
- `output/deck.pptx` — VC-style slide deck (Phase 2)
7-
- `output/report.docx` — narrative memo (Phase 2)
5+
| File | What |
6+
|---|---|
7+
| [`output/dashboard-w26-2026-05-01.html`](output/dashboard-w26-2026-05-01.html) | Phase 1 dashboard for YC W26. Headline: 63.3% coverage of the 196-company batch, with the dropped register naming every excluded company. |
8+
| [`output/coverage-w26-2026-05-01.json`](output/coverage-w26-2026-05-01.json) | Machine-readable coverage report — what feeds the dashboard. |
89

9-
These examples never contain real founder PII, real API keys, or any data not already public on yc.com.
10+
The full quality writeup for W26 is in [`docs/QUALITY_REPORT_W26.md`](../docs/QUALITY_REPORT_W26.md).
1011

11-
`make publish-check` blocks any commit that would put PII into this directory.
12+
Phase 2 will add `deck.pptx` and `report.docx` examples here.

0 commit comments

Comments
 (0)