Auto Config Controller

Auto-Config Controller (v1.8+)

GoldenMatch's auto-config now produces a defensible match/dedupe config the first time, on data shapes it has never been hand-tuned for. No manual blocking-key picks, no scorer-weight tuning, no threshold sweeps. Just hand it a DataFrame.

import goldenmatch as gm
import polars as pl

df = pl.read_csv("customers.csv")
result = gm.dedupe_df(df)        # zero config — controller picks the rest
print(result.clusters)

What it does under the hood

When you call auto_configure_df(df) (directly or via dedupe_df/match_df with no config), the controller:

Computes v0 config via the legacy heuristic (column profiling, blocking-key candidate selection, default scorer weights).
Takes a stratified sample (default 2000 rows; full data when n_rows < 5000).
Runs the pipeline on the sample under a profile_capture() context, where instrumented stages emit a typed ComplexityProfile:
- BlockingProfile: block size distribution, reduction ratio, comparison count
- ScoringProfile: score histogram, dip statistic, mass above/in-borderline, candidates compared, random-pair recall probe
- ClusterProfile: cluster size distribution, transitivity rate, edge-confidence distribution
- MatchkeyProfile: per-field post-transform cardinality
- DataProfile, DomainProfile, ProfileMeta
Asks the policy for a refit. Default HeuristicRefitPolicy runs 10 ordered rules:

Rule	Fires when	Action
`rule_blocking_field_null_heavy`	blocking field's null_rate > 0.10	multi-pass on low-null alternate
`rule_blocking_singleton_trap`	blocking produced 0 candidates	swap to `first_token` on dominant text
`rule_blocking_key_swap`	mass_above==0 with prior decision	swap blocking key + drop stale derived matchkeys
`rule_blocking_too_coarse`	block_sizes_p99 > 10× avg	more selective key
`rule_uniform_heavy_blocking`	uniform-large blocks + borderline-heavy	switch to high-cardinality identity column
`rule_unimodal_scoring`	dip_statistic < 0.01	swap scorer to ensemble
`rule_low_reduction_ratio`	reduction_ratio < 0.5	add multi-pass with soundex
`rule_low_transitivity`	transitivity_rate < 0.85	lower threshold by 0.05
`rule_no_matches`	mass_above_threshold == 0.0	permissive baseline
`rule_recall_gap_suspected`	random-pair probe > 0.05 OR over-tight signature	multi-pass on orthogonal field

Iterates until profile is GREEN, distance to previous profile converges, or budget exhausted (default 3 iterations).
Optionally decorates the committed config with LLMScorerConfig when an API key is available and the profile shows borderline-heavy mass — bounds adapt to the matchkey's threshold dynamically.
Finalizes on full data and returns the config + a ComplexityProfile snapshot + a RunHistory audit trail.

Inspecting the audit trail

result = gm.dedupe_df(df)
history = result.postflight_report.controller_history
print(f"iterations: {history.iteration}")
for entry in history.entries:
    print(f"  iter {entry.iteration}: {entry.profile.health()}")
    if entry.decision:
        print(f"     rule fired: {entry.decision.rule_name}")
        print(f"     rationale:  {entry.decision.rationale}")

Cross-run memory

By default, past committed configs persist in ~/.goldenmatch/autoconfig_memory.db. When you re-run on data with the same shape, the controller starts from the cached config (still verifies it iterates to GREEN on the new sample). Opt out:

GOLDENMATCH_AUTOCONFIG_MEMORY=0 python my_script.py

LLM policy fallback (opt-in)

When the heuristic rules can't reach GREEN on tricky data, optionally enable an LLM-driven policy that proposes a config diff:

GOLDENMATCH_AUTOCONFIG_LLM=1 OPENAI_API_KEY=sk-... python my_script.py

The LLM runs O(1) calls per controller iteration (typical: 0–2 calls per auto_configure_df). Falls back silently when no API key is set.

Per-pair LLM scoring (auto-enable)

Different from the policy fallback above: when the committed profile shows lots of borderline pairs AND an LLM API key is available, the controller decorates the committed config with LLMScorerConfig. This is per-pair scoring (O(N) calls per dataset) — your run uses LLM judgment for borderline matches. Bounded by BudgetConfig(max_calls=500, max_cost_usd=1.0).

Standardization auto-detection

Phone-shaped, email-shaped, name-shaped, address-shaped, zip, and state columns now auto-emit StandardizationConfig rules so formatting variations (e.g. "555-1234" vs "(555) 1234") are normalized before scoring.

Verified accuracy (zero-config)

Dataset	F1	Notes
DBLP-ACM (cross-source)	0.964	Above 0.918 hand-tuned ceiling per Benchmarks
Febrl3 (single-source)	0.944	97% of 0.971 hand-tuned ceiling
NCVR (corruption GT)	0.972	First measurement
DQbench ER (no LLM)	91.04	Up from 62.87 hand-tuned-without-LLM (v1.8) — see v1.9–v1.12 progression below

DQbench progression (v1.8 → v1.12, no LLM)

Version	Composite	T1	T2	T3	Headline change
v1.8	62.87	89.3	58.7	53.8	Introspective controller (baseline)
v1.9	62.87	89.3	58.7	53.8	Best-effort commit + virtual-v0 (parity recovery from premise drift)
v1.10	66.91	88.9	69.0	53.8	5 indicators + corruption_normalize (T2 +10.3pp)
v1.11	66.99	88.9	69.0	53.8	NE infrastructure shipped; no measurable movement (foundation for v1.12)
v1.12	91.04	89.3	97.5	85.5	Path Y: NE on exact matchkeys (T2 +28.5pp, T3 +31.7pp)

See Benchmarks for full breakdowns.

Negative evidence (v1.11 + v1.12)

When an exact matchkey would otherwise emit a false positive (same email shared across distinct entities, etc.), negative_evidence lets the controller subtract a penalty when secondary fields disagree:

matchkeys:
  - name: identity_email
    type: exact
    threshold: 0.5    # required when negative_evidence is set
    fields:
      - field: email
        transforms: [lowercase]
        scorer: exact
        weight: 1.0
    negative_evidence:
      - field: phone
        transforms: [digits_only]
        scorer: exact
        threshold: 0.4
        penalty: 0.3
      - field: address
        transforms: []
        scorer: token_sort
        threshold: 0.4
        penalty: 0.4

Auto-config populates this automatically: promote_negative_evidence walks all matchkeys at config-build time and adds NE for high-identity-prior columns (email/phone/address etc.) that aren't already participating positively. T3-class adversarial datasets where the same email is shared across distinct people — same name, different phone, different address — get filtered at the exact matchkey level via cumulative NE penalty.

Disabling the controller

If you want the legacy v0 heuristic only (no iteration, no LLM, no memory):

from goldenmatch.core.autoconfig import _legacy_auto_configure_v0
cfg = _legacy_auto_configure_v0(df)

Or pass an explicit config to dedupe_df/match_df to bypass auto-config entirely.

Architecture

The controller is implemented in:

goldenmatch/core/autoconfig_controller.py — iteration loop, sample selection, finalize
goldenmatch/core/autoconfig_policy.py — RefitPolicy protocol, HeuristicRefitPolicy, LLMRefitPolicy
goldenmatch/core/autoconfig_rules.py — 14 ordered refit rules (10 + 3 indicator-aware in v1.10 + 1 clustered-identity-guard in v1.11)
goldenmatch/core/autoconfig_negative_evidence.py — eager promote_negative_evidence rule + _pick_scorer_for_column (v1.11+v1.12)
goldenmatch/core/scorer.py — _apply_negative_evidence (weighted matchkeys, v1.11) + _apply_negative_evidence_to_exact_pairs (Path Y post-filter, v1.12)
goldenmatch/core/complexity_profile.py — typed sub-profiles + rollup
goldenmatch/core/profile_emitter.py — thread-local emitter stack
goldenmatch/core/autoconfig_history.py — audit trail
goldenmatch/core/autoconfig_memory.py — cross-run persistence

See Architecture for the broader pipeline.

⚡ GoldenMatch — Entity resolution toolkit | PyPI | GitHub | Open in Colab | MIT License

GoldenMatch

🟡 Golden Suite (Monorepo)

Suite Packages

GoldenCheck · data quality
GoldenFlow · transforms
GoldenPipe · orchestrator
InferMap · schema mapping

Getting Started

Installation
Quick Start
Auto-Config Controller · enhanced through v1.12
Configuration
Verification · new in v1.5
CLI Reference

Core Concepts

AI Integration

Advanced

Reference

pip install goldenmatch
npm install goldenmatch

Uh oh!

Auto Config Controller

Auto-Config Controller (v1.8+)

What it does under the hood

Inspecting the audit trail

Cross-run memory

LLM policy fallback (opt-in)

Per-pair LLM scoring (auto-enable)

Standardization auto-detection

Verified accuracy (zero-config)

DQbench progression (v1.8 → v1.12, no LLM)

Negative evidence (v1.11 + v1.12)

Disabling the controller

Architecture

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

GoldenMatch

Clone this wiki locally