RyanAlberts
diff --git a/‎BACKLOG.md‎
Lines changed: 1 addition & 1 deletion b/‎BACKLOG.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎CHANGELOG.md‎
Lines changed: 4 additions & 1 deletion b/‎CHANGELOG.md‎
Lines changed: 4 additions & 1 deletion
diff --git a/‎docs/QUALITY_REPORT_W26.md‎
Lines changed: 67 additions & 0 deletions b/‎docs/QUALITY_REPORT_W26.md‎
Lines changed: 67 additions & 0 deletions
diff --git a/‎examples/README.md‎
Lines changed: 3 additions & 1 deletion b/‎examples/README.md‎
Lines changed: 3 additions & 1 deletion
@@ -20,7 +20,7 @@ Promoted to GitHub issues when an item survives more than one PR. ADRs for non-t
 - [B004] Tune `MIN_DESCRIPTION_CHARS` (currently 80). The W26 probe surfaced one borderline drop (`moda`, 57 chars). A small calibration study against borderline rows would let us pick a defensible threshold. — surfaced in: W26 quality probe — proposed: PR #2
 - [B005] Name the missing-from-upstream companies, not just count them. Compare yc-oss slugs to a slug list discovered from `/companies/<slug>` profile pages so the dropped register includes "Acme (in YC W26 but not in yc-oss/api)". — surfaced in: W26 quality probe — proposed: PR #2 or #3
 - [B006] Track schema-validation failure rate during enrichment as a tracked metric. The W26 smoke run had 1/5 (20%) parse failures (`velum-labs` — likely rationale exceeded the 400 char limit). Measure this across the full batch and tune prompt or schema if rate exceeds ~5%. — surfaced in: PR #2 smoke — proposed: PR #3
-- [B007] Tech-stack and OSS-posture nearly always come back as `unknown` because the model only sees the YC `long_description`, not the company website. Adding a depth=1 website crawl before the LLM call would let the model identify e.g. "this product is closed-source SaaS" or "uses OpenAI" — significantly improving Tier A signal density. Cost: ~5-10 KB extra context per company. — surfaced in: PR #2 smoke. Confirmed in PR #3 full run: 45 of 83 high-confidence rows have OSS posture `unknown`, 52 have tech_stack `unknown`. — proposed: PR after v0.1
+- [B007] Tech-stack and OSS-posture nearly always come back as `unknown` because the model only sees the YC `long_description`, not the company website. Adding a depth=1 website crawl before the LLM call would let the model identify e.g. "this product is closed-source SaaS" or "uses OpenAI" — significantly improving Tier A signal density. Cost: ~5-10 KB extra context per company. — surfaced in: PR #2 smoke. Confirmed in PR #3 full run. — **resolved in PR #11** (OSS unknown rate 55% → 21%, tech-stack identified mentions 14 → 41 on the same W26 cohort).
 - [B008] Schema-validation failure rate on the full W26 enrichment was **23%** (29 of 124). The lenient parser added in PR #3 only relaxed `industry_secondary`. Most remaining failures likely come from the model emitting `ai_capability` or `tech_stack` values outside our closed enums. Either extend the lenient parser to those fields, capture a sample of raw failed responses to audit, or introduce `tool_use`-style schema enforcement on the API backend so the model is constrained at decode time. — surfaced in: PR #3 full run — **resolved in PR #4** (rate now 0%; root cause was over-strict 400-char cap on `rationale`, fixed by truncate-not-reject; lenient parsing extended to `ai_capability` and `tech_stack`).
 
 ## Done
 
@@ -7,7 +7,10 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 
 ## [Unreleased]
 
-_(no changes since 0.1.0)_
+### Added
+- **PR #11 — depth=1 website crawl (B007 resolved)**: new `src/ycai/crawler.py` module. Polite, robots-aware, max 5 pages per company, 30 KB per page, 4-second timeout. Pages ranked by signal-path priority (`/pricing`, `/security`, `/about`, `/docs`, `/open-source`, …). HTML stripped and PII-sanitized before any LLM call. Crawled URLs are also accepted by the source-URL guard so the LLM can cite specific pages as evidence. New `--no-crawl` flag opts out.
+- W26 with crawler enabled: **OSS posture `unknown` rate dropped 55% → 21%** (target was <30%). Tech-stack identified mentions: 14 → 41. Vision capability: 17 → 26. Multimodal: 17 → 22.
+- 13 new crawler tests (116 total), all network-free via `httpx.MockTransport`. Robots-disallow path-level enforcement, content-type filtering (PDF/JSON skipped), max-pages cap, dedup, fragment stripping, host-restriction (no off-site fetches), PII redaction round-trip.
 
 ## [0.1.0] — 2026-05-01
 
 
@@ -241,3 +241,70 @@ OSS posture is meaningfully different now: closed (48), unknown (65), api-only (
 - Tier A+B with high-confidence LLM analysis: **118 of 196 (60.2%)** — the most honest "what we actually know about W26" number.
 
 The headline coverage % on the dashboard is unchanged at 63.3% (because that's the data-quality denominator). But for the deck/memo, the 60.2% number is what should be cited as "the share of W26 we can substantively classify."
+
+---
+
+## PR #11 — depth=1 website crawl (B007 resolved)
+
+The v0.1 limitation: the LLM only saw the YC `long_description`, so OSS posture and tech stack came back as `unknown` for most companies. PR #11 adds a polite, robots-aware depth=1 website crawl (max 5 pages per company, 30 KB per page, 4-second timeout, ranked by signal-path priority: `/pricing`, `/security`, `/about`, `/docs`, etc.). Each crawled page is HTML-stripped and PII-sanitized before it ever reaches the LLM.
+
+### Coverage didn't change — quality of classification did
+
+The 124-company cohort and 95% high-confidence rate carry over (113 high vs. 118 in PR #9, both well above the v0.1 target). What changed is the model's ability to ground its answers in actual evidence.
+
+### OSS posture, before and after
+
+| Posture | PR #9 (no crawl) | PR #11 (with crawl) | Δ |
+|---|---:|---:|---:|
+| **unknown** | **65 (55%)** | **24 (21%)** | **−41** |
+| closed | 50 (42%) | 75 (66%) | +25 |
+| api-only | 3 | 8 | +5 |
+| source-available | 1 | 5 | +4 |
+| fully-open | 1 | 1 | – |
+
+**OSS-posture `unknown` rate dropped 55% → 21%** — a 62% relative reduction. PR target was <30%. Hit.
+
+The 41 companies that moved out of `unknown` distributed roughly: most went to `closed` (model now has evidence — pricing pages, "Request a demo" CTAs, no GitHub link in footer), some to `api-only` (model spotted "Get an API key" in /docs), and a handful to `source-available` (the model saw a GitHub footer link with explicit license language).
+
+### Tech-stack mentions went from 1 → 14
+
+The headline number is harder to summarize because most YC startups still don't advertise the model provider on marketing pages. But the *absolute count* of identified tech-stack signals is what matters:
+
+| Stack | PR #9 | PR #11 |
+|---|---:|---:|
+| custom-model | 13 | 24 |
+| anthropic | 1 | 6 |
+| openai | 0 | 3 |
+| huggingface | 0 | 2 |
+| pytorch | 0 | 2 |
+| google-gemini | 0 | 2 |
+| qwen | 0 | 1 |
+| langchain | 0 | 1 |
+| **identified (non-unknown)** | **14** | **41** |
+
+Tech-stack `unknown` rate barely moved (64% → 57%) because the homepage of an "AI for legal teams" startup just doesn't mention which model it uses. To push this further would require fetching docs/security pages with depth=2, which we deferred for politeness reasons.
+
+### Capability shifts (mostly small, two interesting movers)
+
+| Capability | PR #9 | PR #11 |
+|---|---:|---:|
+| agents | 68 | 69 |
+| nlp-classic | 38 | 38 |
+| rag | 34 | 32 |
+| data-pipeline | 33 | 37 |
+| **vision** | **17** | **26** |
+| **multimodal** | **17** | **22** |
+
+Vision and multimodal both saw real lifts — those are the capabilities the model can spot from product pages with screenshots, GIFs, and demo videos. Marketing surfaces help here.
+
+### Schema failures: 1 (was 0)
+
+Slightly higher prompt size from the crawled context pushed exactly 1 row's rationale over the cap (`synthetic-sciences`). Captured in `raw_failures.jsonl`. Not worth tightening.
+
+### Headline numbers, updated
+
+- **Coverage of YC official: 63.3%** — unchanged (data-quality denominator)
+- **High-confidence enrichment: 113 / 124 (91%)** — was 118/124 (95%) without crawl, slightly down because longer prompts are slightly harder to keep within the rationale cap
+- **Substantively-classified share of YC W26: 113 / 196 = 57.7%** — was 60.2%
+
+Slightly fewer companies make it into the headline cohort, but each of those cohort entries now carries materially more signal — `oss_posture` and (to a lesser extent) `tech_stack` are now real values for the majority of rows, instead of `unknown` masquerading as data.
@@ -4,7 +4,9 @@ Sanitized sample artifacts. Every commit goes through `make publish-check` so PI
 
 | File | What |
 |---|---|
-| [`output/dashboard-w26-pr4-2026-05-01.html`](output/dashboard-w26-pr4-2026-05-01.html) | **PR #4 dashboard — current best.** 118 of 124 high-confidence (95%). Schema-failure rate now 0%. Top finding: 58% of n=118 build agents. |
+| [`output/dashboard-w26-pr11-2026-05-01.html`](output/dashboard-w26-pr11-2026-05-01.html) | **PR #11 dashboard — current best.** Depth=1 crawl enabled. OSS posture `unknown` rate dropped 55%→21% on the same W26 cohort. Tech-stack identified mentions 14→41. |
+| [`output/analyses-w26-pr11-2026-05-01.json`](output/analyses-w26-pr11-2026-05-01.json) | PR #11 enrichment with crawled context. 113/124 high-confidence. |
+| [`output/dashboard-w26-pr4-2026-05-01.html`](output/dashboard-w26-pr4-2026-05-01.html) | PR #4 / v0.1.0 dashboard. Useful baseline to compare against PR #11 (no crawl, 65 OSS-unknown rows). |
 | [`output/analyses-w26-pr4-2026-05-01.json`](output/analyses-w26-pr4-2026-05-01.json) | **PR #4 enrichment.** 124 companies, 0 schema failures, 6 genuine model lows, 0 hallucinated source URLs. |
 | [`output/dashboard-w26-enriched-2026-05-01.html`](output/dashboard-w26-enriched-2026-05-01.html) | PR #3 dashboard, kept as before/after comparison (67% high-confidence, 23% schema failures). |
 | [`output/dashboard-w26-2026-05-01.html`](output/dashboard-w26-2026-05-01.html) | PR #1 baseline (coverage-only mode, no LLM). |