Skip to content

Commit 47e40be

Browse files
RyanAlbertsclaude
andauthored
feat(phase-1): LLM enrichment with anti-hallucination Layer 1 (#7)
* feat(phase-1): LLM enrichment with anti-hallucination Layer 1 Closes #2. What ships - src/ycai/schemas.py: CompanyAnalysis pydantic model. Industry, AICapability, TechStack, OSSPosture closed-set enums. CrossCheckResult for the two-pass logic. - src/ycai/classifier.py: deterministic prefilling. Maps yc-oss tag soup to the Industry enum without an LLM, reducing the surface area where the model can hallucinate. - src/ycai/researcher.py: the analyze() pipeline plus three Backend implementations: * AgentSDKBackend (default) — claude-agent-sdk against Claude Max subscription, no API cost. * AnthropicAPIBackend — pay-per-token via --api-key or ANTHROPIC_API_KEY. Key never logged or written to disk. * MockBackend — deterministic test backend. - src/ycai/cli.py: --enrich flag opts companies through the enrichment pipeline. --enrich-limit caps for smoke runs. Rich progress UI. - tests/fixtures/hallucination_traps.json: 10 synthetic companies designed to bait the classifier (misleading names, suggestive but vague descriptions, source-URL bait, acronym confusion, etc.). - tests/test_classifier.py + tests/test_researcher.py: 78 tests total (40 new). Schema enforcement, source-URL guard, two-pass cross-check, trap-resistance, PII redaction-in-prompt verification, API-key no-leakage verification. Anti-hallucination Layer 1 invariants 1. Pydantic schema rejects empty sources (min_length=1 on the field). 2. Source URLs must originate from the company website or YC profile URL — fabricated citations downgrade the row to low confidence. 3. confidence=medium triggers a second independent pass; disagreement on industry_primary or oss_posture downgrades to low. 4. Any failure returns a sentinel low-confidence row that survives in the CSV but is excluded from charts. No silent drops. 5. PII is stripped from the prompt before the backend sees it (defense-in-depth even though yc-oss/api fields are public). Live smoke run on 5 W26 companies via subscription - 4 high / 1 low confidence in 39 seconds, ~free on subscription - gru.space correctly identified as no-ai (the model is willing to say a YC company is not an AI company) - velum-labs correctly fell through to low-confidence sentinel when schema validation failed (likely rationale exceeded 400 char limit - tracked as B006) - All cited sources came from inputs (website + YC profile only) - Captured at examples/output/analyses-w26-smoke-2026-05-01.json New backlog - B006: track schema-validation failure rate, tune prompt if >5% - B007: add depth=1 website crawl before LLM call to recover tech stack and OSS posture (currently come back as 'unknown' for most companies because the YC long_description doesn't mention them) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(phase-1): pin anthropic + claude-agent-sdk in pyproject deps CI mypy job didn't have either SDK installed because they were never listed in pyproject.toml deps — they were just available locally. Adding them as runtime deps so CI installs them via pip install -e .[dev]. Both are small wheels and the package needs at least one of them at runtime (anthropic for the API path, claude-agent-sdk for the subscription path), so requiring them is correct, not bloat. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent ceed52e commit 47e40be

15 files changed

Lines changed: 1404 additions & 8 deletions

.secrets.baseline

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -127,5 +127,5 @@
127127
}
128128
],
129129
"results": {},
130-
"generated_at": "2026-05-01T19:00:38Z"
130+
"generated_at": "2026-05-01T19:21:27Z"
131131
}

BACKLOG.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,8 @@ Promoted to GitHub issues when an item survives more than one PR. ADRs for non-t
1919
- [B003] CI annotations report Node 20 actions deprecated (forced to Node 24 from 2026-06-02). Refresh `actions/checkout`, `actions/setup-python`, `gitleaks/gitleaks-action` to Node-24-compatible majors before that date. — surfaced in: phase 0 CI run — proposed: ad-hoc PR before 2026-06-02
2020
- [B004] Tune `MIN_DESCRIPTION_CHARS` (currently 80). The W26 probe surfaced one borderline drop (`moda`, 57 chars). A small calibration study against borderline rows would let us pick a defensible threshold. — surfaced in: W26 quality probe — proposed: PR #2
2121
- [B005] Name the missing-from-upstream companies, not just count them. Compare yc-oss slugs to a slug list discovered from `/companies/<slug>` profile pages so the dropped register includes "Acme (in YC W26 but not in yc-oss/api)". — surfaced in: W26 quality probe — proposed: PR #2 or #3
22+
- [B006] Track schema-validation failure rate during enrichment as a tracked metric. The W26 smoke run had 1/5 (20%) parse failures (`velum-labs` — likely rationale exceeded the 400 char limit). Measure this across the full batch and tune prompt or schema if rate exceeds ~5%. — surfaced in: PR #2 smoke — proposed: PR #3
23+
- [B007] Tech-stack and OSS-posture nearly always come back as `unknown` because the model only sees the YC `long_description`, not the company website. Adding a depth=1 website crawl before the LLM call would let the model identify e.g. "this product is closed-source SaaS" or "uses OpenAI" — significantly improving Tier A signal density. Cost: ~5-10 KB extra context per company. — surfaced in: PR #2 smoke — proposed: PR #3
2224

2325
## Done
2426

CHANGELOG.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,5 +12,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
1212
- Phase 1 PR #1: yc-oss/api scraper, PII sanitizer, link verifier, coverage probe, single-file dashboard, Typer CLI (`ycai run-coverage`).
1313
- Coverage metric is the dashboard headline. The dropped register acknowledges every excluded company and the specific reason — no quiet drops.
1414
- First end-to-end probe on YC W26: 63.3% coverage of the official 196-company batch. Findings in `docs/QUALITY_REPORT_W26.md`.
15+
- Phase 1 PR #2: LLM-based enrichment with anti-hallucination Layer 1 — pydantic-enforced output schema, source-URL guard against fabricated citations, two-pass cross-check on uncertain rows, sentinel low-confidence row on any failure. Three backends: `AgentSDKBackend` (subscription-default), `AnthropicAPIBackend` (`--api-key`), `MockBackend` (tests). 10 hallucination-trap fixtures locked in as regression tests.
16+
- W26 enrichment smoke run (5 companies via subscription, 39s, ~free): 4 high / 1 low confidence. Identified `gru.space` as `no-ai` correctly. Schema-validation failure on `velum-labs` correctly fell through to the sentinel — no fabricated analysis served.
1517

1618
[Unreleased]: https://github.com/RyanAlberts/yc-ai-pulse/compare/main...HEAD

examples/README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@ Sanitized sample artifacts. Every commit goes through `make publish-check` so PI
66
|---|---|
77
| [`output/dashboard-w26-2026-05-01.html`](output/dashboard-w26-2026-05-01.html) | Phase 1 dashboard for YC W26. Headline: 63.3% coverage of the 196-company batch, with the dropped register naming every excluded company. |
88
| [`output/coverage-w26-2026-05-01.json`](output/coverage-w26-2026-05-01.json) | Machine-readable coverage report — what feeds the dashboard. |
9+
| [`output/analyses-w26-smoke-2026-05-01.json`](output/analyses-w26-smoke-2026-05-01.json) | PR #2 smoke run: 5-company LLM enrichment via Sonnet 4.6 on subscription. Captures the schema-enforced output and demonstrates source-URL grounding (every cited URL is from `website` or YC profile). |
910

1011
The full quality writeup for W26 is in [`docs/QUALITY_REPORT_W26.md`](../docs/QUALITY_REPORT_W26.md).
1112

Lines changed: 115 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,115 @@
1+
[
2+
{
3+
"slug": "bidflow",
4+
"industry_primary": "Real Estate / Construction",
5+
"industry_secondary": [
6+
"B2B SaaS"
7+
],
8+
"ai_capability": [
9+
"rag",
10+
"nlp-classic",
11+
"agents"
12+
],
13+
"tech_stack": [
14+
"unknown"
15+
],
16+
"oss_posture": "unknown",
17+
"oss_evidence_url": null,
18+
"tagline_rewrite": "AI copilot that automates electrical contractor RFP estimation, cutting bid prep time dramatically.",
19+
"confidence": "high",
20+
"sources": [
21+
"https://www.ycombinator.com/companies/bidflow",
22+
"https://usebidflow.com/"
23+
],
24+
"rationale": "The description explicitly states the product helps 'electrical contractors do electrical estimates way faster using AI' and targets 'redundant paperwork' in RFP submission, confirming Real Estate/Construction primary with B2B SaaS delivery; YC tags 'AI Assistant' and 'SaaS' corroborate the classification."
25+
},
26+
{
27+
"slug": "travo",
28+
"industry_primary": "Real Estate / Construction",
29+
"industry_secondary": [
30+
"B2B SaaS",
31+
"AI Infrastructure"
32+
],
33+
"ai_capability": [
34+
"rag",
35+
"data-pipeline",
36+
"agents"
37+
],
38+
"tech_stack": [
39+
"unknown"
40+
],
41+
"oss_posture": "unknown",
42+
"oss_evidence_url": null,
43+
"tagline_rewrite": "AI-powered real estate data platform for RV parks and niche asset classes \u2014 comps, ownership, zoning, and financials in one place.",
44+
"confidence": "high",
45+
"sources": [
46+
"https://www.ycombinator.com/companies/travo",
47+
"https://www.travoai.com/"
48+
],
49+
"rationale": "The description explicitly states they 'use AI to collect and analyze real estate data' and 'build the best informed AI applications for real estate private equity firms, developers, and brokers,' confirming a Real Estate / Construction primary with AI-driven data pipeline and RAG-style retrieval capabilities. No specific model or framework is mentioned, so tech_stack is unknown."
50+
},
51+
{
52+
"slug": "galactic-resource-utilization-space-inc-gru-space",
53+
"industry_primary": "Industrials",
54+
"industry_secondary": [
55+
"Real Estate / Construction",
56+
"Consumer",
57+
"Government / Defense"
58+
],
59+
"ai_capability": [
60+
"no-ai"
61+
],
62+
"tech_stack": [],
63+
"oss_posture": "unknown",
64+
"oss_evidence_url": null,
65+
"tagline_rewrite": "In-situ resource utilization to build pressurized lunar habitats, starting with a Moon hotel opening 2032.",
66+
"confidence": "high",
67+
"sources": [
68+
"https://www.ycombinator.com/companies/galactic-resource-utilization-space-inc-gru-space",
69+
"https://gru.space/"
70+
],
71+
"rationale": "The description explicitly focuses on off-planet habitat construction using 'in-situ resource utilization technology, turning local material into building material,' with a roadmap through lunar and Martian infrastructure \u2014 squarely Industrials/Aviation and Space. No AI capabilities or tech stack are mentioned anywhere in the provided text."
72+
},
73+
{
74+
"slug": "autumn-ai",
75+
"industry_primary": "B2B SaaS",
76+
"industry_secondary": [
77+
"AI Infrastructure",
78+
"Developer Tools"
79+
],
80+
"ai_capability": [
81+
"rag",
82+
"data-pipeline",
83+
"nlp-classic",
84+
"agents"
85+
],
86+
"tech_stack": [
87+
"unknown"
88+
],
89+
"oss_posture": "unknown",
90+
"oss_evidence_url": null,
91+
"tagline_rewrite": "Real-time buying signal intelligence platform that monitors web activity to surface high-intent prospects for GTM teams.",
92+
"confidence": "high",
93+
"sources": [
94+
"https://www.ycombinator.com/companies/autumn-ai"
95+
],
96+
"rationale": "The description explicitly states Autumn is 'building the first real-time signal intelligence platform for GTM teams' that 'monitors posts, commits, blogs, and announcements, surfacing buying signals,' indicating a B2B SaaS product with AI-driven data pipeline and NLP capabilities; no specific model providers or OSS artifacts are mentioned."
97+
},
98+
{
99+
"slug": "velum-labs",
100+
"industry_primary": "Unknown",
101+
"industry_secondary": [],
102+
"ai_capability": [
103+
"unclear"
104+
],
105+
"tech_stack": [],
106+
"oss_posture": "unknown",
107+
"oss_evidence_url": null,
108+
"tagline_rewrite": "(no analysis: schema-validation-failure)",
109+
"confidence": "low",
110+
"sources": [
111+
"https://github.com/RyanAlberts/yc-ai-pulse#unverifiable"
112+
],
113+
"rationale": "Auto-generated low-confidence sentinel because: schema-validation-failure"
114+
}
115+
]

pyproject.toml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,8 @@ dependencies = [
2121
"pydantic>=2.5",
2222
"typer>=0.12",
2323
"rich>=13",
24+
"anthropic>=0.40",
25+
"claude-agent-sdk>=0.1",
2426
]
2527

2628
[project.optional-dependencies]

scripts/publish_check.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,7 @@ fi
2626
# - scripts/{secret_scan,publish_check}.sh (these files name the patterns)
2727
# - tests/test_sanitizer.py (test fixtures must contain the patterns)
2828
SUSPICIOUS=$(git ls-files \
29-
| grep -v -E '^(\.secrets\.baseline|scripts/secret_scan\.sh|scripts/publish_check\.sh|tests/test_sanitizer\.py)$' \
29+
| grep -v -E '^(\.secrets\.baseline|scripts/secret_scan\.sh|scripts/publish_check\.sh|tests/test_sanitizer\.py|tests/test_researcher\.py)$' \
3030
| xargs grep -l -E -i 'sk-ant-[A-Za-z0-9_\-]{20,}|ghp_[A-Za-z0-9]{36}|AKIA[0-9A-Z]{16}' 2>/dev/null || true)
3131
if [[ -n "$SUSPICIOUS" ]]; then
3232
echo "❌ suspicious credential strings found in:"

scripts/secret_scan.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -36,7 +36,7 @@ else
3636
# - tests/test_sanitizer.py (test fixtures must contain the patterns
3737
# they're testing redaction of). Reviewed manually — these are fake values.
3838
HITS=$(echo "$FILES" \
39-
| grep -v -E '^(\.secrets\.baseline|scripts/secret_scan\.sh|tests/test_sanitizer\.py)$' \
39+
| grep -v -E '^(\.secrets\.baseline|scripts/secret_scan\.sh|tests/test_sanitizer\.py|tests/test_researcher\.py)$' \
4040
| xargs grep -E -l "$pattern" 2>/dev/null || true)
4141
if [[ -n "$HITS" ]]; then
4242
echo "❌ pattern matched: $pattern"

src/ycai/classifier.py

Lines changed: 78 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,78 @@
1+
"""Taxonomies + deterministic prefilling.
2+
3+
Where we can answer a classification question from yc-oss/api fields alone
4+
(without an LLM), we do. This:
5+
1. saves Sonnet calls,
6+
2. produces a deterministic answer auditors can re-derive,
7+
3. reduces the surface area where the model can hallucinate.
8+
9+
The LLM still classifies AI capability, tech stack, OSS posture, and the
10+
tagline — fields that can't be derived from YC's tag list.
11+
"""
12+
13+
from __future__ import annotations
14+
15+
from ycai.schemas import Industry
16+
17+
# yc-oss industry / subindustry / tag substrings -> our enum.
18+
# Ordered most-specific first; first match wins.
19+
_INDUSTRY_RULES: tuple[tuple[str, Industry], ...] = (
20+
("ai infrastructure", Industry.AI_INFRASTRUCTURE),
21+
("developer tools", Industry.DEVELOPER_TOOLS),
22+
("dev tools", Industry.DEVELOPER_TOOLS),
23+
("security", Industry.SECURITY),
24+
("biotech", Industry.BIOTECH),
25+
("healthcare", Industry.HEALTHCARE),
26+
("medical", Industry.HEALTHCARE),
27+
("fintech", Industry.FINTECH),
28+
("financial", Industry.FINTECH),
29+
("legal", Industry.LEGAL),
30+
("education", Industry.EDUCATION),
31+
("real estate", Industry.REAL_ESTATE_CONSTRUCTION),
32+
("construction", Industry.REAL_ESTATE_CONSTRUCTION),
33+
("logistics", Industry.SUPPLY_CHAIN_LOGISTICS),
34+
("supply chain", Industry.SUPPLY_CHAIN_LOGISTICS),
35+
("climate", Industry.CLIMATE_ENERGY),
36+
("energy", Industry.CLIMATE_ENERGY),
37+
("robotics", Industry.ROBOTICS),
38+
("hardware", Industry.HARDWARE),
39+
("industrials", Industry.INDUSTRIALS),
40+
("government", Industry.GOVERNMENT_DEFENSE),
41+
("defense", Industry.GOVERNMENT_DEFENSE),
42+
("media", Industry.MEDIA_CONTENT),
43+
("content", Industry.MEDIA_CONTENT),
44+
("consumer", Industry.CONSUMER),
45+
("b2b", Industry.B2B_SAAS),
46+
("saas", Industry.B2B_SAAS),
47+
)
48+
49+
50+
def map_industry(yc_industry: str, yc_subindustry: str = "", yc_tags: list[str] | None = None) -> Industry:
51+
"""Map a yc-oss industry/subindustry/tags hint into our enum.
52+
53+
Returns ``Industry.UNKNOWN`` only if absolutely nothing matches — the LLM
54+
can override our guess if it has a stronger signal from the website.
55+
"""
56+
haystack = " ".join(
57+
[yc_industry or "", yc_subindustry or "", " ".join(yc_tags or [])],
58+
).lower()
59+
for needle, industry in _INDUSTRY_RULES:
60+
if needle in haystack:
61+
return industry
62+
return Industry.UNKNOWN
63+
64+
65+
def industry_secondaries(yc_industry: str, yc_subindustry: str, yc_tags: list[str]) -> list[Industry]:
66+
"""Extra industry hits beyond the primary, from the same haystack.
67+
68+
Caps at 3 to keep the chart legible.
69+
"""
70+
haystack = " ".join([yc_industry or "", yc_subindustry or "", " ".join(yc_tags or [])]).lower()
71+
seen: list[Industry] = []
72+
for needle, industry in _INDUSTRY_RULES:
73+
if needle in haystack and industry not in seen:
74+
seen.append(industry)
75+
return seen[1:4] # skip the primary (index 0), take next 3
76+
77+
78+
__all__ = ["industry_secondaries", "map_industry"]

0 commit comments

Comments
 (0)