Skip to content

Commit 5da842b

Browse files
authored
release: v1.5.0 — auto-config verification layer (#45)
Bumps version to 1.5.0 and documents the preflight + postflight verification layer shipped in PR #44: - README: new 'Auto-Config Verification (v1.5.0)' section. - docs/python-api.md: public surface for preflight / postflight / reports / new kwargs (strict, allow_remote_assets). - docs/configuration.md: verification subsection. - docs/quick-start.md: 'Inspecting the verification report' pattern. - examples/verification_inspection.py: end-to-end walkthrough of preflight findings + postflight signals. - examples/strict_mode_parity.py: deterministic parity runs with strict=True. - examples/README.md: updated with new examples. Harmonizes version between pyproject.toml (was 1.4.5) and goldenmatch/__init__.py (was 1.4.4) at 1.5.0.
1 parent 3f4e8f2 commit 5da842b

9 files changed

Lines changed: 541 additions & 2 deletions

File tree

README.md

Lines changed: 66 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -29,6 +29,8 @@ goldenmatch dedupe customers.csv
2929
[![DQBench ER](https://img.shields.io/badge/DQBench%20ER-95.30-gold)](https://github.com/benzsevern/dqbench)
3030
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/benzsevern/goldenmatch/blob/main/scripts/gpu_colab_notebook.ipynb)
3131

32+
> **v1.5.0 is out** — auto-config now runs a preflight + postflight verification layer. Bibliographic and domain-extracted schemas no longer crash under zero-config, remote-asset scorers are demoted by default, and every `DedupeResult` carries an inspectable `postflight_report`. See [Auto-Config Verification](#auto-config-verification-v150).
33+
3234
---
3335

3436
## Why GoldenMatch?
@@ -191,6 +193,70 @@ result = gm.dedupe("products.csv", fuzzy={"title": 0.80}, llm_scorer=True)
191193
result = gm.dedupe("huge.parquet", exact=["email"], backend="ray")
192194
```
193195

196+
### Auto-Config Verification (v1.5.0)
197+
198+
Zero-config used to crash on bibliographic and domain-extracted schemas — auto-config would emit a matchkey referencing `__title_key__` without enabling `config.domain`, and the pipeline would raise `ValueError: Missing required columns`. v1.5.0 closes the gap with a preflight + postflight verification layer that runs automatically around `auto_configure_df`.
199+
200+
**Preflight** (`gm.preflight`) runs 6 checks at the end of `auto_configure_df`:
201+
202+
- column resolution (auto-repairs missing domain-extracted columns by enabling `config.domain`)
203+
- cardinality bounds on exact matchkeys (drops near-unique and near-constant keys)
204+
- block-size sanity (flags blocks that would stall the scorer)
205+
- remote-asset demotion (any `embedding`, `record_embedding`, or cross-encoder rerank is demoted unless you pass `allow_remote_assets=True`)
206+
- confidence-gated weight capping (low-confidence fields cap at weight 0.3)
207+
208+
Unrepairable issues raise `ConfigValidationError` with the full `PreflightReport` attached as `err.report`. Repaired issues stay on the report as `findings` with `repaired=True`.
209+
210+
**Postflight** (`gm.postflight`) runs 4 signals after scoring, before clustering:
211+
212+
- score-distribution histogram + bimodality detection (auto-nudges threshold on clear bimodality)
213+
- blocking-recall estimate (gated at 10K+ rows)
214+
- preliminary cluster sizes + oversized-cluster bottleneck pair
215+
- threshold-band overlap percentage (advises `--llm-auto` when overlap > 20% and LLM is off)
216+
217+
The report attaches to `DedupeResult.postflight_report` / `MatchResult.postflight_report`.
218+
219+
```python
220+
import goldenmatch as gm
221+
import polars as pl
222+
223+
df = pl.read_csv("bibliography.csv")
224+
225+
# Zero-config -- preflight + postflight run automatically
226+
result = gm.dedupe_df(df)
227+
228+
# Inspect the preflight report (private-by-convention underscore)
229+
for finding in result.config._preflight_report.findings:
230+
print(f"[{finding.severity}] {finding.check}: {finding.message}")
231+
232+
# Inspect postflight signals (public)
233+
sig = result.postflight_report.signals
234+
print(f"Scored {sig['total_pairs_scored']} pairs")
235+
print(f"Threshold overlap: {sig['threshold_overlap_pct']:.1%}")
236+
print(f"Oversized clusters: {len(sig['oversized_clusters'])}")
237+
```
238+
239+
**Offline by default.** Remote-asset scorers are demoted unless you opt in:
240+
241+
```python
242+
cfg = gm.auto_configure_df(df, allow_remote_assets=True) # loads cross-encoder etc.
243+
```
244+
245+
**Strict mode for parity runs.** `strict=True` still computes postflight signals and emits advisories, but skips threshold adjustments — use it for DQBench, regression suites, and any reproducible output:
246+
247+
```python
248+
cfg = gm.auto_configure_df(df, strict=True)
249+
```
250+
251+
**New classifier smarts in v1.5.0:**
252+
253+
- Columns with cardinality ≥ 0.95 are classified as `identifier`, not `phone` / `zip` / `numeric`.
254+
- New `year` col_type routes to blocking, not scoring.
255+
- New `multi_name` col_type handles comma/semicolon-delimited author-style fields.
256+
- Low-confidence fields (< 0.5) cap at weight 0.3.
257+
258+
See `examples/verification_inspection.py` and `examples/strict_mode_parity.py` for runnable walkthroughs.
259+
194260
### Privacy-Preserving Linkage
195261

196262
```python

docs/configuration.md

Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -364,3 +364,36 @@ Or auto-generate from data:
364364
```python
365365
config = gm.auto_configure([("data.csv", "source")])
366366
```
367+
368+
---
369+
370+
## Verification (v1.5.0)
371+
372+
`auto_configure_df` runs **preflight** at the end of config generation — 6 checks that auto-repair missing domain-extracted columns, drop useless-cardinality exact matchkeys, flag oversized blocks, demote remote-asset scorers, and cap low-confidence weights. Unrepairable issues raise `ConfigValidationError`; the full report is attached to the exception as `err.report`.
373+
374+
The pipeline runs **postflight** after scoring and before clustering — 4 signals (score histogram + bimodality, blocking recall, cluster sizes + bottleneck pairs, threshold-band overlap) that can auto-nudge the threshold on clear bimodality and attach the report to `DedupeResult.postflight_report` / `MatchResult.postflight_report`.
375+
376+
Two new kwargs on `auto_configure_df`:
377+
378+
```python
379+
import goldenmatch as gm
380+
381+
# Offline-safe (default): remote-asset scorers demoted, postflight may adjust threshold
382+
cfg = gm.auto_configure_df(df)
383+
384+
# Opt in to cross-encoder rerank / embedding scorers
385+
cfg = gm.auto_configure_df(df, allow_remote_assets=True)
386+
387+
# Strict: compute signals + advisories, but suppress auto-adjustments (DQBench, regression)
388+
cfg = gm.auto_configure_df(df, strict=True)
389+
```
390+
391+
The preflight report is available on the returned config (underscore is private-by-convention but stable across v1.5.x):
392+
393+
```python
394+
cfg = gm.auto_configure_df(df)
395+
for finding in cfg._preflight_report.findings:
396+
print(f"[{finding.severity}] {finding.check}: {finding.message}")
397+
```
398+
399+
See the [Verification section in the Python API docs](python-api.html#verification-v150) for the full `preflight` / `postflight` signatures and the `PostflightSignals` schema.

docs/python-api.md

Lines changed: 141 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -462,6 +462,7 @@ class DedupeResult:
462462
stats: dict # total_records, total_clusters, match_rate
463463
scored_pairs: list[tuple] # (id_a, id_b, score) tuples
464464
config: GoldenMatchConfig
465+
postflight_report: PostflightReport | None # v1.5.0: signals + advisories + adjustments
465466

466467
def to_csv(path, which="golden") # Write results to CSV
467468
match_rate: float # Property: percentage of dupes
@@ -479,6 +480,7 @@ class MatchResult:
479480
matched: pl.DataFrame | None # Matched target records with scores
480481
unmatched: pl.DataFrame | None # Unmatched target records
481482
stats: dict
483+
postflight_report: PostflightReport | None # v1.5.0: signals + advisories + adjustments
482484

483485
def to_csv(path)
484486
```
@@ -680,9 +682,148 @@ gm.profile_dataframe(df) -> dict
680682

681683
```python
682684
gm.auto_configure(file_specs) -> GoldenMatchConfig
685+
gm.auto_configure_df(
686+
df,
687+
llm_provider=None,
688+
domain_config=None,
689+
llm_auto=False,
690+
strict=False, # v1.5.0
691+
allow_remote_assets=False, # v1.5.0
692+
) -> GoldenMatchConfig
683693
gm.suggest_threshold(df, matchkey) -> float
684694
```
685695

696+
New v1.5.0 kwargs on `auto_configure_df`:
697+
698+
- `strict` — compute postflight signals and emit advisories, but suppress auto-adjustments (threshold nudges, etc.). Use for DQBench / regression / reproducibility runs.
699+
- `allow_remote_assets` — permit `embedding`, `record_embedding`, and cross-encoder rerank scorers. Default `False` demotes them so auto-config is offline-safe and never triggers a surprise HuggingFace download.
700+
701+
The returned config carries `config._preflight_report: PreflightReport` (underscore — private-by-convention but stable across v1.5.x).
702+
703+
---
704+
705+
## Verification (v1.5.0)
706+
707+
The preflight + postflight layer validates an auto-generated config against the data it was built for. `auto_configure_df` runs preflight automatically at the end; the pipeline runs postflight automatically after scoring. Both are also callable directly.
708+
709+
### preflight
710+
711+
```python
712+
gm.preflight(
713+
df: pl.DataFrame,
714+
config: GoldenMatchConfig,
715+
*,
716+
profiles: list[ColumnProfile] | None = None,
717+
allow_remote_assets: bool = False,
718+
) -> PreflightReport
719+
```
720+
721+
Runs 6 checks on `(df, config)`:
722+
723+
1. **Column resolution** — every column referenced by blocking/matchkeys exists, or is a pipeline-synthesized `__mk_*`, or is a domain-extracted column recoverable by enabling `config.domain` (auto-repaired when a domain profile was stashed during auto-config).
724+
2. **Exact-matchkey cardinality** — drops keys with ratio >= 0.99 (near-unique, no pair ever agrees) or < 0.01 (near-constant, produces giant blocks).
725+
3. **Block-size sanity** — samples blocking keys and flags blocks that would stall the scorer.
726+
4. **Remote-asset demotion**`embedding` / `record_embedding` / cross-encoder rerank scorers are demoted unless `allow_remote_assets=True`.
727+
5. **Weight confidence capping** — matchkey fields with profile confidence < 0.5 cap at weight 0.3 (requires `profiles` kwarg).
728+
6. **Domain auto-repair** — when a column like `__title_key__` is missing but a domain profile is available, enables `config.domain` so the pipeline produces the column at runtime.
729+
730+
Auto-repairs what it can (setting `finding.repaired = True`) and records unrepairable issues as `severity="error"` findings. `auto_configure_df` raises `ConfigValidationError` if `report.has_errors`.
731+
732+
```python
733+
report = gm.preflight(df, config)
734+
for f in report.findings:
735+
print(f"[{f.severity}] {f.check}: {f.message} (repaired={f.repaired})")
736+
```
737+
738+
### postflight
739+
740+
```python
741+
gm.postflight(
742+
df: pl.DataFrame,
743+
config: GoldenMatchConfig,
744+
*,
745+
pair_scores: list[tuple[int, int, float]],
746+
current_threshold: float | None = None,
747+
) -> PostflightReport
748+
```
749+
750+
Runs 4 signals on scored pairs:
751+
752+
- **Score histogram + bimodality** — if the score distribution is clearly bimodal (valley depth ratio < 0.5) and the valley is > 0.05 away from the current threshold, emits a `PostflightAdjustment` nudging the threshold to the valley. Suppressed under `strict=True`.
753+
- **Blocking recall estimate** — gated at >= 10K rows; returns `"deferred"` below that.
754+
- **Preliminary cluster sizes + oversized-cluster bottleneck pair** — p50/p95/p99/max plus a list of oversized clusters with their weakest edge.
755+
- **Threshold-band overlap** — fraction of pairs within 0.02 of the threshold. Advises `--llm-auto` when > 20% and LLM scorer is off.
756+
757+
```python
758+
report = gm.postflight(df, config, pair_scores=scored, current_threshold=0.85)
759+
print(report.signals["threshold_overlap_pct"])
760+
for adj in report.adjustments:
761+
print(f"{adj.field}: {adj.from_value} -> {adj.to_value} ({adj.reason})")
762+
```
763+
764+
### Report shapes
765+
766+
```python
767+
@dataclass
768+
class PreflightFinding:
769+
check: str # "missing_column" | "cardinality" | "block_size" |
770+
# "remote_asset" | "weight_confidence"
771+
severity: str # "error" | "warning" | "info"
772+
subject: str # column / matchkey name
773+
message: str
774+
repaired: bool
775+
repair_note: str | None
776+
777+
@dataclass
778+
class PreflightReport:
779+
findings: list[PreflightFinding]
780+
config_was_modified: bool
781+
has_errors: bool # property: True if any unrepaired error
782+
783+
class ConfigValidationError(Exception):
784+
report: PreflightReport # full report attached for programmatic inspection
785+
786+
@dataclass
787+
class PostflightAdjustment:
788+
field: str # e.g. "threshold"
789+
from_value: Any
790+
to_value: Any
791+
reason: str
792+
signal: str # which signal motivated the change
793+
794+
@dataclass
795+
class PostflightReport:
796+
signals: PostflightSignals # TypedDict, schema below
797+
adjustments: list[PostflightAdjustment]
798+
advisories: list[str]
799+
```
800+
801+
### PostflightSignals schema
802+
803+
The `signals` dict is a stable TypedDict contract (defined in `goldenmatch/core/autoconfig_verify.py`):
804+
805+
```python
806+
class PostflightSignals(TypedDict):
807+
score_histogram: ScoreHistogram # {"bins": list[float], "counts": list[int]}
808+
blocking_recall: float | Literal["deferred"] # "deferred" when <10K rows
809+
block_size_percentiles: BlockSizePercentiles # {"p50", "p95", "p99", "max"}
810+
threshold_overlap_pct: float # fraction of pairs within 0.02 of threshold
811+
total_pairs_scored: int
812+
current_threshold: float
813+
preliminary_cluster_sizes: ClusterSizePercentiles
814+
# {"p50", "p95", "p99", "max", "count"}
815+
oversized_clusters: list[OversizedCluster]
816+
# each: {"cluster_id": int, "size": int, "bottleneck_pair": [int, int]}
817+
```
818+
819+
`ScoreHistogram`, `BlockSizePercentiles`, `ClusterSizePercentiles`, `OversizedCluster` are all TypedDicts — import them from `goldenmatch.core.autoconfig_verify` if you want to type-check consumer code.
820+
821+
### Where the reports live
822+
823+
- `config._preflight_report: PreflightReport | None` — set by `auto_configure_df`. Underscore-prefixed, documented as private-by-convention; stable contract.
824+
- `DedupeResult.postflight_report: PostflightReport | None` — set by the pipeline after scoring.
825+
- `MatchResult.postflight_report: PostflightReport | None` — same for match flows.
826+
686827
---
687828

688829
## Active learning

docs/quick-start.md

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -49,6 +49,30 @@ result.golden # Polars DataFrame of canonical records
4949

5050
---
5151

52+
## Inspecting the verification report (v1.5.0)
53+
54+
Zero-config runs attach a `PostflightReport` to the result — score-distribution signals, cluster-size percentiles, threshold-band overlap, plus any auto-applied adjustments and human-readable advisories.
55+
56+
```python
57+
result = gm.dedupe_df(df)
58+
if result.postflight_report:
59+
for adv in result.postflight_report.advisories:
60+
print(f"advisory: {adv}")
61+
for adj in result.postflight_report.adjustments:
62+
print(f"adjusted {adj.field}: {adj.from_value} -> {adj.to_value} ({adj.reason})")
63+
```
64+
65+
The auto-generated config also carries a `PreflightReport` for the checks that ran during `auto_configure_df`:
66+
67+
```python
68+
for finding in result.config._preflight_report.findings:
69+
print(f"[{finding.severity}] {finding.check}: {finding.message}")
70+
```
71+
72+
See [Verification](python-api.html#verification-v150) in the Python API docs for the full signatures and signal schema.
73+
74+
---
75+
5276
## Match two files
5377

5478
```python

examples/README.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,8 @@ python basic_dedupe.py
2626
| `agent_demo.py` | Autonomous ER agent with confidence gating and review queue | goldenmatch |
2727
| `benchmark.py` | DQBench ER benchmark (precision, recall, F1, throughput) | goldenmatch, dqbench |
2828
| `equipment_dedup.py` | Equipment/auction dedup: multi-pass blocking, ANN fallback, weighted fuzzy, LLM calibration | goldenmatch, OPENAI_API_KEY |
29+
| `verification_inspection.py` | v1.5.0 preflight + postflight walkthrough -- inspect findings, signals, advisories, and adjustments | goldenmatch |
30+
| `strict_mode_parity.py` | v1.5.0 `strict=True` for deterministic parity / regression runs | goldenmatch |
2931

3032
## For Coding AIs
3133

0 commit comments

Comments
 (0)