feat(phase-1): scraper + sanitizer + coverage probe with W26 quality report#6
Conversation
…report Phase 1 PR #1: ships the data-quality floor before any LLM cost is incurred. What this lands - src/ycai/schemas.py: pydantic models — RawCompany, CoverageRecord, BatchCoverage, CoverageTier, DropReason. Single source of truth for what a company looks like at every pipeline stage. - src/ycai/scraper.py: yc-oss/api as the only sanctioned source per ADR 0001. Hard-fails on unreachable upstream — no fallback to the robots.txt-disallowed `ycombinator.com/companies?batch=...` URL. - src/ycai/sanitizer.py: defensive PII strip (email, phone, address, API keys) before any data hits disk or the LLM. - src/ycai/coverage.py: tier classifier (A/B/C) + dropped register. Coverage = (Tier A + Tier B) / total. - src/ycai/verifier.py: async link-checker, HEAD with GET fallback. - src/ycai/dashboard.py: single-file HTML output. Headline metric is coverage; the dropped register is rendered before any chart so quality issues are unmissable. No CDN, opens offline. - src/ycai/cli.py: `ycai run-coverage` wires it together. Quality probe — the user's feature request The coverage probe acknowledges every dropped company and the specific reason (no quiet drops). Two coverage % numbers: vs. upstream, and vs. known YC-official count. The latter is the headline. W26 first run: 63.3% coverage of the 196-company batch. 64 companies missing from yc-oss/api due to upstream staleness (last refreshed 2026-02-08); 8 dropped for missing fields (named in the register); 4 dead websites (kept as Tier B with a flag). Findings in docs/QUALITY_REPORT_W26.md and the sanitized example dashboard at examples/output/dashboard-w26-2026-05-01.html. Hygiene - 41 tests pass (sanitizer, scraper, coverage, smoke). - Pre-commit + publish-check green. - Test fixtures with intentional fake API keys gated by inline pragma + script exclusions so we keep credential blocking strict for everything else. - Two new BACKLOG entries: B004 (description threshold tuning), B005 (name the missing-from-upstream companies). Closes #1. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Code Review
This pull request implements Phase 1 of the project, introducing a scraper for yc-oss/api, a PII sanitizer, an async link verifier, and a data-quality coverage probe. It adds a Typer-based CLI with a run-coverage command that generates machine-readable JSON reports and interactive HTML dashboards. Key logic includes tier-based classification of companies to identify data gaps and a 'dropped register' for transparency. Feedback focuses on improving type safety in the CLI, refining HTML/JSON escaping in the dashboard generator, and ensuring dynamic rendering of company tiers in the output tables.
| industry_rows = "\n".join( | ||
| f"<tr><td><code>{_escape(c.slug)}</code></td>" | ||
| f"<td>{_escape(c.name)}</td>" | ||
| f"<td>{_escape(c.industry)}</td>" | ||
| f'<td><span class="badge tier-A">A</span></td></tr>' | ||
| for c in keepers | ||
| ) |
There was a problem hiding this comment.
The industry rows table currently hardcodes the 'Tier A' badge. Since the report includes both Tier A and Tier B companies, the badge should dynamically reflect the actual tier of each company.
| industry_rows = "\n".join( | |
| f"<tr><td><code>{_escape(c.slug)}</code></td>" | |
| f"<td>{_escape(c.name)}</td>" | |
| f"<td>{_escape(c.industry)}</td>" | |
| f'<td><span class="badge tier-A">A</span></td></tr>' | |
| for c in keepers | |
| ) | |
| slug_to_tier = {r.slug: r.tier for r in coverage.records} | |
| industry_rows = "\n".join( | |
| f"<tr><td><code>{_escape(c.slug)}</code></td>" | |
| f"<td>{_escape(c.name)}</td>" | |
| f"<td>{_escape(c.industry)}</td>" | |
| f'<td><span class="badge tier-{slug_to_tier[c.slug]}">{slug_to_tier[c.slug]}</span></td></tr>' | |
| for c in keepers | |
| ) |
| region_chart=_bar_chart(regions, coverage.analyzable_count, top=12), | ||
| dropped_table=_dropped_table(coverage), | ||
| methodology_text=methodology, | ||
| raw_data_json=_escape(json.dumps(raw, default=str)), |
There was a problem hiding this comment.
HTML-escaping the JSON string inside a <script type="application/json"> tag breaks the JSON format (e.g., by replacing quotes with "). This will cause client-side parsing to fail. To safely embed JSON in HTML, you should only escape the sequence </ to prevent the browser from prematurely closing the script tag.
| raw_data_json=_escape(json.dumps(raw, default=str)), | |
| raw_data_json=json.dumps(raw, default=str).replace("</", "<\\/"), |
| import logging | ||
| from datetime import UTC, datetime | ||
| from pathlib import Path | ||
|
|
||
| def app() -> None: | ||
| """Entry shim. Replaced with a Typer app in Phase 1.""" | ||
| print("yc-ai-pulse: Phase 0 scaffold. CLI lands in Phase 1.") | ||
| import typer | ||
| from rich.console import Console | ||
| from rich.table import Table | ||
|
|
||
| from ycai import __version__ | ||
| from ycai.coverage import compute_coverage, coverage_summary_lines | ||
| from ycai.dashboard import render as render_dashboard | ||
| from ycai.scraper import UpstreamError, fetch_batch, upstream_age_hours | ||
| from ycai.verifier import check_urls, split_by_status |
There was a problem hiding this comment.
The csv module should be imported at the top level. Additionally, importing RawCompany here allows for better type hinting in the _write_csv function.
| import logging | |
| from datetime import UTC, datetime | |
| from pathlib import Path | |
| def app() -> None: | |
| """Entry shim. Replaced with a Typer app in Phase 1.""" | |
| print("yc-ai-pulse: Phase 0 scaffold. CLI lands in Phase 1.") | |
| import typer | |
| from rich.console import Console | |
| from rich.table import Table | |
| from ycai import __version__ | |
| from ycai.coverage import compute_coverage, coverage_summary_lines | |
| from ycai.dashboard import render as render_dashboard | |
| from ycai.scraper import UpstreamError, fetch_batch, upstream_age_hours | |
| from ycai.verifier import check_urls, split_by_status | |
| import csv | |
| import logging | |
| from datetime import UTC, datetime | |
| from pathlib import Path | |
| import typer | |
| from rich.console import Console | |
| from rich.table import Table | |
| from ycai import __version__ | |
| from ycai.coverage import compute_coverage, coverage_summary_lines | |
| from ycai.dashboard import render as render_dashboard | |
| from ycai.schemas import RawCompany | |
| from ycai.scraper import UpstreamError, fetch_batch, upstream_age_hours | |
| from ycai.verifier import check_urls, split_by_status |
| def _write_csv(companies: list, path: Path) -> None: | ||
| """Tiny CSV writer that doesn't pull in pandas just for serialization.""" | ||
| import csv |
There was a problem hiding this comment.
Specifying list[RawCompany] instead of the generic list improves type safety. The local csv import can be removed as it is now handled at the top level.
| def _write_csv(companies: list, path: Path) -> None: | |
| """Tiny CSV writer that doesn't pull in pandas just for serialization.""" | |
| import csv | |
| def _write_csv(companies: list[RawCompany], path: Path) -> None: | |
| """Tiny CSV writer that doesn't pull in pandas just for serialization.""" |
| import json | ||
| from collections import Counter | ||
| from pathlib import Path | ||
|
|
||
| from ycai.schemas import BatchCoverage, CoverageTier, RawCompany |
There was a problem hiding this comment.
Add import html to the top-level imports to support more robust HTML escaping.
| import json | |
| from collections import Counter | |
| from pathlib import Path | |
| from ycai.schemas import BatchCoverage, CoverageTier, RawCompany | |
| import html | |
| import json | |
| from collections import Counter | |
| from pathlib import Path | |
| from ycai.schemas import BatchCoverage, CoverageTier, RawCompany |
| def _escape(text: str) -> str: | ||
| return text.replace("&", "&").replace("<", "<").replace(">", ">").replace('"', """) |
There was a problem hiding this comment.
Using html.escape is preferred over a manual implementation as it is more comprehensive and follows standard practices.
| def _escape(text: str) -> str: | |
| return text.replace("&", "&").replace("<", "<").replace(">", ">").replace('"', """) | |
| def _escape(text: str) -> str: | |
| return html.escape(text) |
Mypy --strict in CI flagged 9 errors not visible without strict locally. Fixes: - scraper.py: type-narrow dict.get() results before int/str/parse_iso - dashboard.py: explicit Counter[str] annotations - cli.py: import RawCompany for _write_csv concrete signature - sanitizer.py: drop unused type:ignore mypy 1.20 rejects - isinstance(x, (int, str)) -> isinstance(x, int | str) (UP038) 41 tests green, ruff clean, mypy --strict clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
First publishable release. What ships in v0.1.0 - Phase 0 + Phase 1 of the project plan: scraper, sanitizer, link verifier, coverage probe, LLM enrichment with anti-hallucination Layer 1, enriched dashboard, cited-URL publish gate, resume + re- render commands, raw failure capture. - 103 tests passing, mypy --strict clean, secret-scan clean. - Real W26 results checked in: 63.3% coverage, 95% high-confidence on the LLM enrichment, 0 schema failures, 0 hallucinated source URLs, top finding 'W26 = the agentic batch' on n=118. Mechanics - pyproject.toml: 0.0.1 -> 0.1.0, classifier bumped pre-alpha -> alpha. - src/ycai/__init__.py: __version__ matches. - tests/test_smoke.py: version assertion bumped. - CHANGELOG.md: 0.1.0 release notes synthesizing PR #6-#9. - README.md: status table updated, quickstart documents the actual v0.1 commands (run-coverage / resume / dashboard). - .github/workflows/release.yml: build wheel+sdist on tag push, publish to PyPI via Trusted Publishing (id-token), attach artifacts to GitHub release. Local smoke - python -m build produces yc_ai_pulse-0.1.0-py3-none-any.whl (38KB) and yc_ai_pulse-0.1.0.tar.gz (134KB). - pipx install --force <wheel> succeeds; ycai version returns 0.1.0; ycai run-coverage --batch winter-2026 succeeds end-to-end from a clean /tmp directory. PyPI Trusted Publishing setup (one-time, on PyPI side) - https://pypi.org/manage/project/yc-ai-pulse/settings/publishing/ - Repo: RyanAlberts/yc-ai-pulse - Workflow: release.yml - Environment: pypi - Until configured, the publish job will fail; the GitHub release job still attaches built wheels. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
What
Phase 1 PR #1: lands the data-quality floor before any LLM cost is incurred. First end-to-end probe on real W26 data revealed only 63.3% of the 196-company batch is analyzable from the upstream feed.
Closes #1.
Why
Per the user's directive: "don't hallucinate, ground each dashboard in data, acknowledge any companies that drop off due to data quality, and report a % of YC batch coverage metric." This PR makes coverage the headline metric and the dropped register the most-visible block on the dashboard — no quiet drops.
How
src/ycai/schemas.pysrc/ycai/scraper.pysrc/ycai/sanitizer.pysrc/ycai/coverage.pysrc/ycai/verifier.pysrc/ycai/dashboard.pysrc/ycai/cli.pyycai run-coveragewires it all up.W26 quality probe results
64 companies are missing because yc-oss/api is stale (last refreshed 2026-02-08, ~3 months before W26 closed). 8 are dropped for missing fields and named explicitly. 4 had dead websites.
Sanitized example:
examples/output/dashboard-w26-2026-05-01.html. Full writeup:docs/QUALITY_REPORT_W26.md.Test plan
make publish-checkgreen (test fixtures with intentional fake credential patterns gated by inline pragmas + script exclusions).dashboard.html,coverage.json,companies.csv. Verified manually.Anti-hallucination invariants this PR adds
Backlog spawned by this PR
MIN_DESCRIPTION_CHARSbased on borderline rows./companies/<slug>profile pages).Acceptance
make validate-p0green locallymake publish-checkgreenexamples/output/🤖 Generated with Claude Code