Skip to content

feat(phase-1): scraper + sanitizer + coverage probe with W26 quality report#6

Merged
RyanAlberts merged 2 commits into
mainfrom
phase-1-pr1-scraper-coverage
May 1, 2026
Merged

feat(phase-1): scraper + sanitizer + coverage probe with W26 quality report#6
RyanAlberts merged 2 commits into
mainfrom
phase-1-pr1-scraper-coverage

Conversation

@RyanAlberts
Copy link
Copy Markdown
Owner

What

Phase 1 PR #1: lands the data-quality floor before any LLM cost is incurred. First end-to-end probe on real W26 data revealed only 63.3% of the 196-company batch is analyzable from the upstream feed.

Closes #1.

Why

Per the user's directive: "don't hallucinate, ground each dashboard in data, acknowledge any companies that drop off due to data quality, and report a % of YC batch coverage metric." This PR makes coverage the headline metric and the dropped register the most-visible block on the dashboard — no quiet drops.

How

Module Role
src/ycai/schemas.py Pydantic models. Single source of truth.
src/ycai/scraper.py yc-oss/api as the only sanctioned source per ADR 0001. Hard-fail on unreachable.
src/ycai/sanitizer.py Defensive PII strip before disk/LLM. Idempotent.
src/ycai/coverage.py Tier A/B/C classifier + dropped register.
src/ycai/verifier.py Async link-checker (HEAD + GET fallback).
src/ycai/dashboard.py Single-file HTML. Coverage headline + dropped register render before any chart.
src/ycai/cli.py ycai run-coverage wires it all up.

W26 quality probe results

Batch: Winter 2026 (132 in upstream / 196 official)
Tier A (full):     120
Tier B (partial):  4    (website 4xx/5xx — kept with a flag)
Tier C (excluded): 8    (named in the dropped register)
Coverage of upstream:    93.9%
Coverage of YC official: 63.3%  ← headline

64 companies are missing because yc-oss/api is stale (last refreshed 2026-02-08, ~3 months before W26 closed). 8 are dropped for missing fields and named explicitly. 4 had dead websites.

Sanitized example: examples/output/dashboard-w26-2026-05-01.html. Full writeup: docs/QUALITY_REPORT_W26.md.

Test plan

  • 41 tests pass (sanitizer, scraper, coverage, smoke). Network-free thanks to httpx MockTransport.
  • make publish-check green (test fixtures with intentional fake credential patterns gated by inline pragmas + script exclusions).
  • Real run on W26: produces dashboard.html, coverage.json, companies.csv. Verified manually.
  • Hard-fail when yc-oss is unreachable: tested via 404 mock.
  • PII strip is idempotent: tested.
  • Sanitizer redacts before data hits disk: tested via cache_dir round-trip.

Anti-hallucination invariants this PR adds

  • All numbers in the dashboard are computed in Python from validated rows. No LLM yet.
  • Required-fields gate (description ≥80 chars, website starts with http/https, industry non-empty) before any company appears in charts.
  • Every dropped company is named with a specific reason; no quiet exclusions.
  • Coverage % is rendered against both upstream and (when known) YC-official count, so upstream staleness is visible.

Backlog spawned by this PR

  • B004: tune MIN_DESCRIPTION_CHARS based on borderline rows.
  • B005: name the 64 missing-from-upstream W26 companies (compare yc-oss slugs to a slug list discovered from /companies/<slug> profile pages).

Acceptance

🤖 Generated with Claude Code

…report

Phase 1 PR #1: ships the data-quality floor before any LLM cost is incurred.

What this lands
- src/ycai/schemas.py: pydantic models — RawCompany, CoverageRecord,
  BatchCoverage, CoverageTier, DropReason. Single source of truth for
  what a company looks like at every pipeline stage.
- src/ycai/scraper.py: yc-oss/api as the only sanctioned source per ADR
  0001. Hard-fails on unreachable upstream — no fallback to the
  robots.txt-disallowed `ycombinator.com/companies?batch=...` URL.
- src/ycai/sanitizer.py: defensive PII strip (email, phone, address,
  API keys) before any data hits disk or the LLM.
- src/ycai/coverage.py: tier classifier (A/B/C) + dropped register.
  Coverage = (Tier A + Tier B) / total.
- src/ycai/verifier.py: async link-checker, HEAD with GET fallback.
- src/ycai/dashboard.py: single-file HTML output. Headline metric is
  coverage; the dropped register is rendered before any chart so quality
  issues are unmissable. No CDN, opens offline.
- src/ycai/cli.py: `ycai run-coverage` wires it together.

Quality probe — the user's feature request
The coverage probe acknowledges every dropped company and the specific
reason (no quiet drops). Two coverage % numbers: vs. upstream, and vs.
known YC-official count. The latter is the headline.

W26 first run: 63.3% coverage of the 196-company batch. 64 companies
missing from yc-oss/api due to upstream staleness (last refreshed
2026-02-08); 8 dropped for missing fields (named in the register); 4
dead websites (kept as Tier B with a flag). Findings in
docs/QUALITY_REPORT_W26.md and the sanitized example dashboard at
examples/output/dashboard-w26-2026-05-01.html.

Hygiene
- 41 tests pass (sanitizer, scraper, coverage, smoke).
- Pre-commit + publish-check green.
- Test fixtures with intentional fake API keys gated by inline
  pragma + script exclusions so we keep credential blocking strict
  for everything else.
- Two new BACKLOG entries: B004 (description threshold tuning), B005
  (name the missing-from-upstream companies).

Closes #1.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements Phase 1 of the project, introducing a scraper for yc-oss/api, a PII sanitizer, an async link verifier, and a data-quality coverage probe. It adds a Typer-based CLI with a run-coverage command that generates machine-readable JSON reports and interactive HTML dashboards. Key logic includes tier-based classification of companies to identify data gaps and a 'dropped register' for transparency. Feedback focuses on improving type safety in the CLI, refining HTML/JSON escaping in the dashboard generator, and ensuring dynamic rendering of company tiers in the output tables.

Comment thread src/ycai/dashboard.py
Comment on lines +214 to +220
industry_rows = "\n".join(
f"<tr><td><code>{_escape(c.slug)}</code></td>"
f"<td>{_escape(c.name)}</td>"
f"<td>{_escape(c.industry)}</td>"
f'<td><span class="badge tier-A">A</span></td></tr>'
for c in keepers
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The industry rows table currently hardcodes the 'Tier A' badge. Since the report includes both Tier A and Tier B companies, the badge should dynamically reflect the actual tier of each company.

Suggested change
industry_rows = "\n".join(
f"<tr><td><code>{_escape(c.slug)}</code></td>"
f"<td>{_escape(c.name)}</td>"
f"<td>{_escape(c.industry)}</td>"
f'<td><span class="badge tier-A">A</span></td></tr>'
for c in keepers
)
slug_to_tier = {r.slug: r.tier for r in coverage.records}
industry_rows = "\n".join(
f"<tr><td><code>{_escape(c.slug)}</code></td>"
f"<td>{_escape(c.name)}</td>"
f"<td>{_escape(c.industry)}</td>"
f'<td><span class="badge tier-{slug_to_tier[c.slug]}">{slug_to_tier[c.slug]}</span></td></tr>'
for c in keepers
)

Comment thread src/ycai/dashboard.py
region_chart=_bar_chart(regions, coverage.analyzable_count, top=12),
dropped_table=_dropped_table(coverage),
methodology_text=methodology,
raw_data_json=_escape(json.dumps(raw, default=str)),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

HTML-escaping the JSON string inside a <script type="application/json"> tag breaks the JSON format (e.g., by replacing quotes with &quot;). This will cause client-side parsing to fail. To safely embed JSON in HTML, you should only escape the sequence </ to prevent the browser from prematurely closing the script tag.

Suggested change
raw_data_json=_escape(json.dumps(raw, default=str)),
raw_data_json=json.dumps(raw, default=str).replace("</", "<\\/"),

Comment thread src/ycai/cli.py
Comment on lines +8 to +20
import logging
from datetime import UTC, datetime
from pathlib import Path

def app() -> None:
"""Entry shim. Replaced with a Typer app in Phase 1."""
print("yc-ai-pulse: Phase 0 scaffold. CLI lands in Phase 1.")
import typer
from rich.console import Console
from rich.table import Table

from ycai import __version__
from ycai.coverage import compute_coverage, coverage_summary_lines
from ycai.dashboard import render as render_dashboard
from ycai.scraper import UpstreamError, fetch_batch, upstream_age_hours
from ycai.verifier import check_urls, split_by_status
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The csv module should be imported at the top level. Additionally, importing RawCompany here allows for better type hinting in the _write_csv function.

Suggested change
import logging
from datetime import UTC, datetime
from pathlib import Path
def app() -> None:
"""Entry shim. Replaced with a Typer app in Phase 1."""
print("yc-ai-pulse: Phase 0 scaffold. CLI lands in Phase 1.")
import typer
from rich.console import Console
from rich.table import Table
from ycai import __version__
from ycai.coverage import compute_coverage, coverage_summary_lines
from ycai.dashboard import render as render_dashboard
from ycai.scraper import UpstreamError, fetch_batch, upstream_age_hours
from ycai.verifier import check_urls, split_by_status
import csv
import logging
from datetime import UTC, datetime
from pathlib import Path
import typer
from rich.console import Console
from rich.table import Table
from ycai import __version__
from ycai.coverage import compute_coverage, coverage_summary_lines
from ycai.dashboard import render as render_dashboard
from ycai.schemas import RawCompany
from ycai.scraper import UpstreamError, fetch_batch, upstream_age_hours
from ycai.verifier import check_urls, split_by_status

Comment thread src/ycai/cli.py Outdated
Comment on lines +146 to +148
def _write_csv(companies: list, path: Path) -> None:
"""Tiny CSV writer that doesn't pull in pandas just for serialization."""
import csv
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Specifying list[RawCompany] instead of the generic list improves type safety. The local csv import can be removed as it is now handled at the top level.

Suggested change
def _write_csv(companies: list, path: Path) -> None:
"""Tiny CSV writer that doesn't pull in pandas just for serialization."""
import csv
def _write_csv(companies: list[RawCompany], path: Path) -> None:
"""Tiny CSV writer that doesn't pull in pandas just for serialization."""

Comment thread src/ycai/dashboard.py
Comment on lines +7 to +11
import json
from collections import Counter
from pathlib import Path

from ycai.schemas import BatchCoverage, CoverageTier, RawCompany
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Add import html to the top-level imports to support more robust HTML escaping.

Suggested change
import json
from collections import Counter
from pathlib import Path
from ycai.schemas import BatchCoverage, CoverageTier, RawCompany
import html
import json
from collections import Counter
from pathlib import Path
from ycai.schemas import BatchCoverage, CoverageTier, RawCompany

Comment thread src/ycai/dashboard.py
Comment on lines +148 to +149
def _escape(text: str) -> str:
return text.replace("&", "&amp;").replace("<", "&lt;").replace(">", "&gt;").replace('"', "&quot;")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using html.escape is preferred over a manual implementation as it is more comprehensive and follows standard practices.

Suggested change
def _escape(text: str) -> str:
return text.replace("&", "&amp;").replace("<", "&lt;").replace(">", "&gt;").replace('"', "&quot;")
def _escape(text: str) -> str:
return html.escape(text)

Mypy --strict in CI flagged 9 errors not visible without strict
locally. Fixes:
- scraper.py: type-narrow dict.get() results before int/str/parse_iso
- dashboard.py: explicit Counter[str] annotations
- cli.py: import RawCompany for _write_csv concrete signature
- sanitizer.py: drop unused type:ignore mypy 1.20 rejects
- isinstance(x, (int, str)) -> isinstance(x, int | str) (UP038)

41 tests green, ruff clean, mypy --strict clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@RyanAlberts RyanAlberts merged commit ceed52e into main May 1, 2026
3 checks passed
@RyanAlberts RyanAlberts deleted the phase-1-pr1-scraper-coverage branch May 1, 2026 19:09
@RyanAlberts RyanAlberts mentioned this pull request May 1, 2026
8 tasks
RyanAlberts added a commit that referenced this pull request May 1, 2026
First publishable release.

What ships in v0.1.0
- Phase 0 + Phase 1 of the project plan: scraper, sanitizer, link
  verifier, coverage probe, LLM enrichment with anti-hallucination
  Layer 1, enriched dashboard, cited-URL publish gate, resume + re-
  render commands, raw failure capture.
- 103 tests passing, mypy --strict clean, secret-scan clean.
- Real W26 results checked in: 63.3% coverage, 95% high-confidence
  on the LLM enrichment, 0 schema failures, 0 hallucinated source
  URLs, top finding 'W26 = the agentic batch' on n=118.

Mechanics
- pyproject.toml: 0.0.1 -> 0.1.0, classifier bumped pre-alpha -> alpha.
- src/ycai/__init__.py: __version__ matches.
- tests/test_smoke.py: version assertion bumped.
- CHANGELOG.md: 0.1.0 release notes synthesizing PR #6-#9.
- README.md: status table updated, quickstart documents the actual
  v0.1 commands (run-coverage / resume / dashboard).
- .github/workflows/release.yml: build wheel+sdist on tag push,
  publish to PyPI via Trusted Publishing (id-token), attach
  artifacts to GitHub release.

Local smoke
- python -m build produces yc_ai_pulse-0.1.0-py3-none-any.whl (38KB)
  and yc_ai_pulse-0.1.0.tar.gz (134KB).
- pipx install --force <wheel> succeeds; ycai version returns 0.1.0;
  ycai run-coverage --batch winter-2026 succeeds end-to-end from
  a clean /tmp directory.

PyPI Trusted Publishing setup (one-time, on PyPI side)
- https://pypi.org/manage/project/yc-ai-pulse/settings/publishing/
- Repo: RyanAlberts/yc-ai-pulse
- Workflow: release.yml
- Environment: pypi
- Until configured, the publish job will fail; the GitHub release
  job still attaches built wheels.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

PR #1 — yc-oss/api scraper + sanitizer

1 participant