Skip to content

[v2 follow-up] Per-industry coverage gate (banks/REITs would silently tank the global gate) #19

@ninyawee

Description

@ninyawee

Problem

`scripts/coverage_gate.py` (slice #18) currently computes ONE global coverage % across all rows in `financial_lines.parquet`:

```
coverage = non_null_concept_count / total_count
```

This works for the v0 dataset where every symbol is industrial-template (currently just PTT). It will break the moment we ingest the first bank, REIT, or insurer.

Why: Banks (BBL, SCB, KBANK) have entirely different line items than industrials — `interest_income` instead of `revenue`, `loans_to_customers` instead of `inventory`, etc. Even with a perfect bank-aware concept dictionary, each bank ingested will increase the unmapped-row count for industrial concepts (the bank doesn't have those lines) and the gate's global coverage % will drop by ~0.5pp per bank.

Result: every `workflow_dispatch` that adds banks to the universe will fail the default `--max-regression-pp 1.0` gate. Engineers will start passing `--max-regression-pp 99` to bypass, defeating the whole purpose.

Solution

Make the gate per-industry, weighted by `concepts.applicable_industries`:

```
coverage_per_industry = (
non_null_concept_count_for_rows_in_industry
/ total_count_for_rows_in_industry
)

Compare per industry, fail if ANY industry regresses > max_pp

```

Requires:

  1. `filings.parquet` carries an `industry` column per (symbol, period). Source: SET sector classification (need to scrape or hand-tag).
  2. `coverage_gate.py` joins `financial_lines.parquet` with `filings.parquet` on `filing_id` to get per-row industry, then groups coverage by `(industry, concept's applicable_industries)` intersection.
  3. Per-industry threshold (default 1.0pp; configurable per industry — banks may justify a wider threshold initially as the dictionary fills out).

Workaround until this lands

Run with `--max-regression-pp 5` (or higher) when adding non-industrial symbols for the first time. Document per-industry baseline coverage in the build log so future deltas are interpretable.

Origin

Surfaced by planner #2 in the multi-company scout fan-out planning round. See `data/concepts.csv` v0.1 schema delta (commit 8434916) which added the `applicable_industries` column making this fix possible.

Out of scope for this issue

  • Industry classification source (SET API scrape vs hand-tag); will need its own decision.
  • Cross-industry concepts (revenue/cogs/equity) need to be counted in every industry's denominator — clarify in implementation.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions