You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Bumps version to 1.5.0 and documents the preflight + postflight
verification layer shipped in PR #44:
- README: new 'Auto-Config Verification (v1.5.0)' section.
- docs/python-api.md: public surface for preflight / postflight /
reports / new kwargs (strict, allow_remote_assets).
- docs/configuration.md: verification subsection.
- docs/quick-start.md: 'Inspecting the verification report' pattern.
- examples/verification_inspection.py: end-to-end walkthrough of
preflight findings + postflight signals.
- examples/strict_mode_parity.py: deterministic parity runs with
strict=True.
- examples/README.md: updated with new examples.
Harmonizes version between pyproject.toml (was 1.4.5) and
goldenmatch/__init__.py (was 1.4.4) at 1.5.0.
[](https://colab.research.google.com/github/benzsevern/goldenmatch/blob/main/scripts/gpu_colab_notebook.ipynb)
31
31
32
+
> **v1.5.0 is out** — auto-config now runs a preflight + postflight verification layer. Bibliographic and domain-extracted schemas no longer crash under zero-config, remote-asset scorers are demoted by default, and every `DedupeResult` carries an inspectable `postflight_report`. See [Auto-Config Verification](#auto-config-verification-v150).
33
+
32
34
---
33
35
34
36
## Why GoldenMatch?
@@ -191,6 +193,70 @@ result = gm.dedupe("products.csv", fuzzy={"title": 0.80}, llm_scorer=True)
191
193
result = gm.dedupe("huge.parquet", exact=["email"], backend="ray")
192
194
```
193
195
196
+
### Auto-Config Verification (v1.5.0)
197
+
198
+
Zero-config used to crash on bibliographic and domain-extracted schemas — auto-config would emit a matchkey referencing `__title_key__` without enabling `config.domain`, and the pipeline would raise `ValueError: Missing required columns`. v1.5.0 closes the gap with a preflight + postflight verification layer that runs automatically around `auto_configure_df`.
199
+
200
+
**Preflight** (`gm.preflight`) runs 6 checks at the end of `auto_configure_df`:
201
+
202
+
- column resolution (auto-repairs missing domain-extracted columns by enabling `config.domain`)
203
+
- cardinality bounds on exact matchkeys (drops near-unique and near-constant keys)
204
+
- block-size sanity (flags blocks that would stall the scorer)
205
+
- remote-asset demotion (any `embedding`, `record_embedding`, or cross-encoder rerank is demoted unless you pass `allow_remote_assets=True`)
206
+
- confidence-gated weight capping (low-confidence fields cap at weight 0.3)
207
+
208
+
Unrepairable issues raise `ConfigValidationError` with the full `PreflightReport` attached as `err.report`. Repaired issues stay on the report as `findings` with `repaired=True`.
209
+
210
+
**Postflight** (`gm.postflight`) runs 4 signals after scoring, before clustering:
**Offline by default.** Remote-asset scorers are demoted unless you opt in:
240
+
241
+
```python
242
+
cfg = gm.auto_configure_df(df, allow_remote_assets=True) # loads cross-encoder etc.
243
+
```
244
+
245
+
**Strict mode for parity runs.**`strict=True` still computes postflight signals and emits advisories, but skips threshold adjustments — use it for DQBench, regression suites, and any reproducible output:
246
+
247
+
```python
248
+
cfg = gm.auto_configure_df(df, strict=True)
249
+
```
250
+
251
+
**New classifier smarts in v1.5.0:**
252
+
253
+
- Columns with cardinality ≥ 0.95 are classified as `identifier`, not `phone` / `zip` / `numeric`.
254
+
- New `year` col_type routes to blocking, not scoring.
255
+
- New `multi_name` col_type handles comma/semicolon-delimited author-style fields.
256
+
- Low-confidence fields (< 0.5) cap at weight 0.3.
257
+
258
+
See `examples/verification_inspection.py` and `examples/strict_mode_parity.py` for runnable walkthroughs.
`auto_configure_df`runs **preflight** at the end of config generation — 6 checks that auto-repair missing domain-extracted columns, drop useless-cardinality exact matchkeys, flag oversized blocks, demote remote-asset scorers, and cap low-confidence weights. Unrepairable issues raise `ConfigValidationError`; the full report is attached to the exception as `err.report`.
373
+
374
+
The pipeline runs **postflight** after scoring and before clustering — 4 signals (score histogram + bimodality, blocking recall, cluster sizes + bottleneck pairs, threshold-band overlap) that can auto-nudge the threshold on clear bimodality and attach the report to `DedupeResult.postflight_report` / `MatchResult.postflight_report`.
375
+
376
+
Two new kwargs on `auto_configure_df`:
377
+
378
+
```python
379
+
import goldenmatch as gm
380
+
381
+
# Offline-safe (default): remote-asset scorers demoted, postflight may adjust threshold
382
+
cfg = gm.auto_configure_df(df)
383
+
384
+
# Opt in to cross-encoder rerank / embedding scorers
See the [Verification section in the Python API docs](python-api.html#verification-v150) for the full `preflight` / `postflight` signatures and the `PostflightSignals` schema.
-`strict` — compute postflight signals and emit advisories, but suppress auto-adjustments (threshold nudges, etc.). Use for DQBench / regression / reproducibility runs.
699
+
-`allow_remote_assets` — permit `embedding`, `record_embedding`, and cross-encoder rerank scorers. Default `False` demotes them so auto-config is offline-safe and never triggers a surprise HuggingFace download.
700
+
701
+
The returned config carries `config._preflight_report: PreflightReport` (underscore — private-by-convention but stable across v1.5.x).
702
+
703
+
---
704
+
705
+
## Verification (v1.5.0)
706
+
707
+
The preflight + postflight layer validates an auto-generated config against the data it was built for. `auto_configure_df` runs preflight automatically at the end; the pipeline runs postflight automatically after scoring. Both are also callable directly.
708
+
709
+
### preflight
710
+
711
+
```python
712
+
gm.preflight(
713
+
df: pl.DataFrame,
714
+
config: GoldenMatchConfig,
715
+
*,
716
+
profiles: list[ColumnProfile] |None=None,
717
+
allow_remote_assets: bool=False,
718
+
) -> PreflightReport
719
+
```
720
+
721
+
Runs 6 checks on `(df, config)`:
722
+
723
+
1.**Column resolution** — every column referenced by blocking/matchkeys exists, or is a pipeline-synthesized `__mk_*`, or is a domain-extracted column recoverable by enabling `config.domain` (auto-repaired when a domain profile was stashed during auto-config).
724
+
2.**Exact-matchkey cardinality** — drops keys with ratio >= 0.99 (near-unique, no pair ever agrees) or < 0.01 (near-constant, produces giant blocks).
725
+
3.**Block-size sanity** — samples blocking keys and flags blocks that would stall the scorer.
5.**Weight confidence capping** — matchkey fields with profile confidence < 0.5 cap at weight 0.3 (requires `profiles` kwarg).
728
+
6.**Domain auto-repair** — when a column like `__title_key__` is missing but a domain profile is available, enables `config.domain` so the pipeline produces the column at runtime.
729
+
730
+
Auto-repairs what it can (setting `finding.repaired = True`) and records unrepairable issues as `severity="error"` findings. `auto_configure_df` raises `ConfigValidationError` if `report.has_errors`.
-**Score histogram + bimodality** — if the score distribution is clearly bimodal (valley depth ratio < 0.5) and the valley is > 0.05 away from the current threshold, emits a `PostflightAdjustment` nudging the threshold to the valley. Suppressed under `strict=True`.
`ScoreHistogram`, `BlockSizePercentiles`, `ClusterSizePercentiles`, `OversizedCluster` are all TypedDicts — import them from `goldenmatch.core.autoconfig_verify` if you want to type-check consumer code.
820
+
821
+
### Where the reports live
822
+
823
+
-`config._preflight_report: PreflightReport | None` — set by `auto_configure_df`. Underscore-prefixed, documented as private-by-convention; stable contract.
824
+
-`DedupeResult.postflight_report: PostflightReport | None` — set by the pipeline after scoring.
825
+
-`MatchResult.postflight_report: PostflightReport | None` — same for match flows.
Copy file name to clipboardExpand all lines: docs/quick-start.md
+24Lines changed: 24 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -49,6 +49,30 @@ result.golden # Polars DataFrame of canonical records
49
49
50
50
---
51
51
52
+
## Inspecting the verification report (v1.5.0)
53
+
54
+
Zero-config runs attach a `PostflightReport` to the result — score-distribution signals, cluster-size percentiles, threshold-band overlap, plus any auto-applied adjustments and human-readable advisories.
0 commit comments