Problem
`scripts/coverage_gate.py` (slice #18) currently computes ONE global coverage % across all rows in `financial_lines.parquet`:
```
coverage = non_null_concept_count / total_count
```
This works for the v0 dataset where every symbol is industrial-template (currently just PTT). It will break the moment we ingest the first bank, REIT, or insurer.
Why: Banks (BBL, SCB, KBANK) have entirely different line items than industrials — `interest_income` instead of `revenue`, `loans_to_customers` instead of `inventory`, etc. Even with a perfect bank-aware concept dictionary, each bank ingested will increase the unmapped-row count for industrial concepts (the bank doesn't have those lines) and the gate's global coverage % will drop by ~0.5pp per bank.
Result: every `workflow_dispatch` that adds banks to the universe will fail the default `--max-regression-pp 1.0` gate. Engineers will start passing `--max-regression-pp 99` to bypass, defeating the whole purpose.
Solution
Make the gate per-industry, weighted by `concepts.applicable_industries`:
```
coverage_per_industry = (
non_null_concept_count_for_rows_in_industry
/ total_count_for_rows_in_industry
)
Compare per industry, fail if ANY industry regresses > max_pp
```
Requires:
- `filings.parquet` carries an `industry` column per (symbol, period). Source: SET sector classification (need to scrape or hand-tag).
- `coverage_gate.py` joins `financial_lines.parquet` with `filings.parquet` on `filing_id` to get per-row industry, then groups coverage by `(industry, concept's applicable_industries)` intersection.
- Per-industry threshold (default 1.0pp; configurable per industry — banks may justify a wider threshold initially as the dictionary fills out).
Workaround until this lands
Run with `--max-regression-pp 5` (or higher) when adding non-industrial symbols for the first time. Document per-industry baseline coverage in the build log so future deltas are interpretable.
Origin
Surfaced by planner #2 in the multi-company scout fan-out planning round. See `data/concepts.csv` v0.1 schema delta (commit 8434916) which added the `applicable_industries` column making this fix possible.
Out of scope for this issue
- Industry classification source (SET API scrape vs hand-tag); will need its own decision.
- Cross-industry concepts (revenue/cogs/equity) need to be counted in every industry's denominator — clarify in implementation.
Problem
`scripts/coverage_gate.py` (slice #18) currently computes ONE global coverage % across all rows in `financial_lines.parquet`:
```
coverage = non_null_concept_count / total_count
```
This works for the v0 dataset where every symbol is industrial-template (currently just PTT). It will break the moment we ingest the first bank, REIT, or insurer.
Why: Banks (BBL, SCB, KBANK) have entirely different line items than industrials — `interest_income` instead of `revenue`, `loans_to_customers` instead of `inventory`, etc. Even with a perfect bank-aware concept dictionary, each bank ingested will increase the unmapped-row count for industrial concepts (the bank doesn't have those lines) and the gate's global coverage % will drop by ~0.5pp per bank.
Result: every `workflow_dispatch` that adds banks to the universe will fail the default `--max-regression-pp 1.0` gate. Engineers will start passing `--max-regression-pp 99` to bypass, defeating the whole purpose.
Solution
Make the gate per-industry, weighted by `concepts.applicable_industries`:
```
coverage_per_industry = (
non_null_concept_count_for_rows_in_industry
/ total_count_for_rows_in_industry
)
Compare per industry, fail if ANY industry regresses > max_pp
```
Requires:
Workaround until this lands
Run with `--max-regression-pp 5` (or higher) when adding non-industrial symbols for the first time. Document per-industry baseline coverage in the build log so future deltas are interpretable.
Origin
Surfaced by planner #2 in the multi-company scout fan-out planning round. See `data/concepts.csv` v0.1 schema delta (commit 8434916) which added the `applicable_industries` column making this fix possible.
Out of scope for this issue