-
-
Notifications
You must be signed in to change notification settings - Fork 11
Auto Config Controller
GoldenMatch's auto-config now produces a defensible match/dedupe config the first time, on data shapes it has never been hand-tuned for. No manual blocking-key picks, no scorer-weight tuning, no threshold sweeps. Just hand it a DataFrame.
import goldenmatch as gm
import polars as pl
df = pl.read_csv("customers.csv")
result = gm.dedupe_df(df) # zero config — controller picks the rest
print(result.clusters)When you call auto_configure_df(df) (directly or via dedupe_df/match_df with no config), the controller:
- Computes v0 config via the legacy heuristic (column profiling, blocking-key candidate selection, default scorer weights).
-
Takes a stratified sample (default 2000 rows; full data when
n_rows < 5000). -
Runs the pipeline on the sample under a
profile_capture()context, where instrumented stages emit a typedComplexityProfile:-
BlockingProfile: block size distribution, reduction ratio, comparison count -
ScoringProfile: score histogram, dip statistic, mass above/in-borderline, candidates compared, random-pair recall probe -
ClusterProfile: cluster size distribution, transitivity rate, edge-confidence distribution -
MatchkeyProfile: per-field post-transform cardinality -
DataProfile,DomainProfile,ProfileMeta
-
-
Asks the policy for a refit. Default
HeuristicRefitPolicyruns 10 ordered rules:
| Rule | Fires when | Action |
|---|---|---|
rule_blocking_field_null_heavy |
blocking field's null_rate > 0.10 | multi-pass on low-null alternate |
rule_blocking_singleton_trap |
blocking produced 0 candidates | swap to first_token on dominant text |
rule_blocking_key_swap |
mass_above==0 with prior decision | swap blocking key + drop stale derived matchkeys |
rule_blocking_too_coarse |
block_sizes_p99 > 10× avg | more selective key |
rule_uniform_heavy_blocking |
uniform-large blocks + borderline-heavy | switch to high-cardinality identity column |
rule_unimodal_scoring |
dip_statistic < 0.01 | swap scorer to ensemble |
rule_low_reduction_ratio |
reduction_ratio < 0.5 | add multi-pass with soundex |
rule_low_transitivity |
transitivity_rate < 0.85 | lower threshold by 0.05 |
rule_no_matches |
mass_above_threshold == 0.0 | permissive baseline |
rule_recall_gap_suspected |
random-pair probe > 0.05 OR over-tight signature | multi-pass on orthogonal field |
- Iterates until profile is GREEN, distance to previous profile converges, or budget exhausted (default 3 iterations).
-
Optionally decorates the committed config with
LLMScorerConfigwhen an API key is available and the profile shows borderline-heavy mass — bounds adapt to the matchkey's threshold dynamically. -
Finalizes on full data and returns the config + a
ComplexityProfilesnapshot + aRunHistoryaudit trail.
result = gm.dedupe_df(df)
history = result.postflight_report.controller_history
print(f"iterations: {history.iteration}")
for entry in history.entries:
print(f" iter {entry.iteration}: {entry.profile.health()}")
if entry.decision:
print(f" rule fired: {entry.decision.rule_name}")
print(f" rationale: {entry.decision.rationale}")By default, past committed configs persist in ~/.goldenmatch/autoconfig_memory.db. When you re-run on data with the same shape, the controller starts from the cached config (still verifies it iterates to GREEN on the new sample). Opt out:
GOLDENMATCH_AUTOCONFIG_MEMORY=0 python my_script.pyWhen the heuristic rules can't reach GREEN on tricky data, optionally enable an LLM-driven policy that proposes a config diff:
GOLDENMATCH_AUTOCONFIG_LLM=1 OPENAI_API_KEY=sk-... python my_script.pyThe LLM runs O(1) calls per controller iteration (typical: 0–2 calls per auto_configure_df). Falls back silently when no API key is set.
Different from the policy fallback above: when the committed profile shows lots of borderline pairs AND an LLM API key is available, the controller decorates the committed config with LLMScorerConfig. This is per-pair scoring (O(N) calls per dataset) — your run uses LLM judgment for borderline matches. Bounded by BudgetConfig(max_calls=500, max_cost_usd=1.0).
Phone-shaped, email-shaped, name-shaped, address-shaped, zip, and state columns now auto-emit StandardizationConfig rules so formatting variations (e.g. "555-1234" vs "(555) 1234") are normalized before scoring.
| Dataset | F1 | Notes |
|---|---|---|
| DBLP-ACM (cross-source) | 0.964 | Above 0.918 hand-tuned ceiling per Benchmarks |
| Febrl3 (single-source) | 0.944 | 97% of 0.971 hand-tuned ceiling |
| NCVR (corruption GT) | 0.972 | First measurement |
| DQbench ER (no LLM) | 91.04 | Up from 62.87 hand-tuned-without-LLM (v1.8) — see v1.9–v1.12 progression below |
| Version | Composite | T1 | T2 | T3 | Headline change |
|---|---|---|---|---|---|
| v1.8 | 62.87 | 89.3 | 58.7 | 53.8 | Introspective controller (baseline) |
| v1.9 | 62.87 | 89.3 | 58.7 | 53.8 | Best-effort commit + virtual-v0 (parity recovery from premise drift) |
| v1.10 | 66.91 | 88.9 | 69.0 | 53.8 | 5 indicators + corruption_normalize (T2 +10.3pp) |
| v1.11 | 66.99 | 88.9 | 69.0 | 53.8 | NE infrastructure shipped; no measurable movement (foundation for v1.12) |
| v1.12 | 91.04 | 89.3 | 97.5 | 85.5 | Path Y: NE on exact matchkeys (T2 +28.5pp, T3 +31.7pp) |
See Benchmarks for full breakdowns.
When an exact matchkey would otherwise emit a false positive (same email shared across distinct entities, etc.), negative_evidence lets the controller subtract a penalty when secondary fields disagree:
matchkeys:
- name: identity_email
type: exact
threshold: 0.5 # required when negative_evidence is set
fields:
- field: email
transforms: [lowercase]
scorer: exact
weight: 1.0
negative_evidence:
- field: phone
transforms: [digits_only]
scorer: exact
threshold: 0.4
penalty: 0.3
- field: address
transforms: []
scorer: token_sort
threshold: 0.4
penalty: 0.4Auto-config populates this automatically: promote_negative_evidence walks all matchkeys at config-build time and adds NE for high-identity-prior columns (email/phone/address etc.) that aren't already participating positively. T3-class adversarial datasets where the same email is shared across distinct people — same name, different phone, different address — get filtered at the exact matchkey level via cumulative NE penalty.
If you want the legacy v0 heuristic only (no iteration, no LLM, no memory):
from goldenmatch.core.autoconfig import _legacy_auto_configure_v0
cfg = _legacy_auto_configure_v0(df)Or pass an explicit config to dedupe_df/match_df to bypass auto-config entirely.
The controller is implemented in:
-
goldenmatch/core/autoconfig_controller.py— iteration loop, sample selection, finalize -
goldenmatch/core/autoconfig_policy.py—RefitPolicyprotocol,HeuristicRefitPolicy,LLMRefitPolicy -
goldenmatch/core/autoconfig_rules.py— 14 ordered refit rules (10 + 3 indicator-aware in v1.10 + 1 clustered-identity-guard in v1.11) -
goldenmatch/core/autoconfig_negative_evidence.py— eagerpromote_negative_evidencerule +_pick_scorer_for_column(v1.11+v1.12) -
goldenmatch/core/scorer.py—_apply_negative_evidence(weighted matchkeys, v1.11) +_apply_negative_evidence_to_exact_pairs(Path Y post-filter, v1.12) -
goldenmatch/core/complexity_profile.py— typed sub-profiles + rollup -
goldenmatch/core/profile_emitter.py— thread-local emitter stack -
goldenmatch/core/autoconfig_history.py— audit trail -
goldenmatch/core/autoconfig_memory.py— cross-run persistence
See Architecture for the broader pipeline.
⚡ GoldenMatch — Entity resolution toolkit | PyPI | GitHub | Open in Colab | MIT License
🟡 Golden Suite (Monorepo)
Suite Packages
- GoldenCheck · data quality
- GoldenFlow · transforms
- GoldenPipe · orchestrator
- InferMap · schema mapping
Getting Started
- Installation
- Quick Start
- Auto-Config Controller · enhanced through v1.12
- Configuration
- Verification · new in v1.5
- CLI Reference
Core Concepts
AI Integration
Advanced
- PPRL
- Domain Packs
- Streaming / CDC
- Database Integration
- GPU & Vertex AI
- REST API
- Interactive TUI
- Web UI · new in v1.7
- Evaluation
Reference
pip install goldenmatch
npm install goldenmatch