Skip to content

Auto Config Controller

benzsevern edited this page May 10, 2026 · 2 revisions

Auto-Config Controller (v1.8+)

GoldenMatch's auto-config now produces a defensible match/dedupe config the first time, on data shapes it has never been hand-tuned for. No manual blocking-key picks, no scorer-weight tuning, no threshold sweeps. Just hand it a DataFrame.

import goldenmatch as gm
import polars as pl

df = pl.read_csv("customers.csv")
result = gm.dedupe_df(df)        # zero config — controller picks the rest
print(result.clusters)

What it does under the hood

When you call auto_configure_df(df) (directly or via dedupe_df/match_df with no config), the controller:

  1. Computes v0 config via the legacy heuristic (column profiling, blocking-key candidate selection, default scorer weights).
  2. Takes a stratified sample (default 2000 rows; full data when n_rows < 5000).
  3. Runs the pipeline on the sample under a profile_capture() context, where instrumented stages emit a typed ComplexityProfile:
    • BlockingProfile: block size distribution, reduction ratio, comparison count
    • ScoringProfile: score histogram, dip statistic, mass above/in-borderline, candidates compared, random-pair recall probe
    • ClusterProfile: cluster size distribution, transitivity rate, edge-confidence distribution
    • MatchkeyProfile: per-field post-transform cardinality
    • DataProfile, DomainProfile, ProfileMeta
  4. Asks the policy for a refit. Default HeuristicRefitPolicy runs 10 ordered rules:
Rule Fires when Action
rule_blocking_field_null_heavy blocking field's null_rate > 0.10 multi-pass on low-null alternate
rule_blocking_singleton_trap blocking produced 0 candidates swap to first_token on dominant text
rule_blocking_key_swap mass_above==0 with prior decision swap blocking key + drop stale derived matchkeys
rule_blocking_too_coarse block_sizes_p99 > 10× avg more selective key
rule_uniform_heavy_blocking uniform-large blocks + borderline-heavy switch to high-cardinality identity column
rule_unimodal_scoring dip_statistic < 0.01 swap scorer to ensemble
rule_low_reduction_ratio reduction_ratio < 0.5 add multi-pass with soundex
rule_low_transitivity transitivity_rate < 0.85 lower threshold by 0.05
rule_no_matches mass_above_threshold == 0.0 permissive baseline
rule_recall_gap_suspected random-pair probe > 0.05 OR over-tight signature multi-pass on orthogonal field
  1. Iterates until profile is GREEN, distance to previous profile converges, or budget exhausted (default 3 iterations).
  2. Optionally decorates the committed config with LLMScorerConfig when an API key is available and the profile shows borderline-heavy mass — bounds adapt to the matchkey's threshold dynamically.
  3. Finalizes on full data and returns the config + a ComplexityProfile snapshot + a RunHistory audit trail.

Inspecting the audit trail

result = gm.dedupe_df(df)
history = result.postflight_report.controller_history
print(f"iterations: {history.iteration}")
for entry in history.entries:
    print(f"  iter {entry.iteration}: {entry.profile.health()}")
    if entry.decision:
        print(f"     rule fired: {entry.decision.rule_name}")
        print(f"     rationale:  {entry.decision.rationale}")

Cross-run memory

By default, past committed configs persist in ~/.goldenmatch/autoconfig_memory.db. When you re-run on data with the same shape, the controller starts from the cached config (still verifies it iterates to GREEN on the new sample). Opt out:

GOLDENMATCH_AUTOCONFIG_MEMORY=0 python my_script.py

LLM policy fallback (opt-in)

When the heuristic rules can't reach GREEN on tricky data, optionally enable an LLM-driven policy that proposes a config diff:

GOLDENMATCH_AUTOCONFIG_LLM=1 OPENAI_API_KEY=sk-... python my_script.py

The LLM runs O(1) calls per controller iteration (typical: 0–2 calls per auto_configure_df). Falls back silently when no API key is set.

Per-pair LLM scoring (auto-enable)

Different from the policy fallback above: when the committed profile shows lots of borderline pairs AND an LLM API key is available, the controller decorates the committed config with LLMScorerConfig. This is per-pair scoring (O(N) calls per dataset) — your run uses LLM judgment for borderline matches. Bounded by BudgetConfig(max_calls=500, max_cost_usd=1.0).

Standardization auto-detection

Phone-shaped, email-shaped, name-shaped, address-shaped, zip, and state columns now auto-emit StandardizationConfig rules so formatting variations (e.g. "555-1234" vs "(555) 1234") are normalized before scoring.

Verified accuracy (zero-config)

Dataset F1 Notes
DBLP-ACM (cross-source) 0.964 Above 0.918 hand-tuned ceiling per Benchmarks
Febrl3 (single-source) 0.944 97% of 0.971 hand-tuned ceiling
NCVR (corruption GT) 0.972 First measurement
DQbench ER (no LLM) 91.04 Up from 62.87 hand-tuned-without-LLM (v1.8) — see v1.9–v1.12 progression below

DQbench progression (v1.8 → v1.12, no LLM)

Version Composite T1 T2 T3 Headline change
v1.8 62.87 89.3 58.7 53.8 Introspective controller (baseline)
v1.9 62.87 89.3 58.7 53.8 Best-effort commit + virtual-v0 (parity recovery from premise drift)
v1.10 66.91 88.9 69.0 53.8 5 indicators + corruption_normalize (T2 +10.3pp)
v1.11 66.99 88.9 69.0 53.8 NE infrastructure shipped; no measurable movement (foundation for v1.12)
v1.12 91.04 89.3 97.5 85.5 Path Y: NE on exact matchkeys (T2 +28.5pp, T3 +31.7pp)

See Benchmarks for full breakdowns.

Negative evidence (v1.11 + v1.12)

When an exact matchkey would otherwise emit a false positive (same email shared across distinct entities, etc.), negative_evidence lets the controller subtract a penalty when secondary fields disagree:

matchkeys:
  - name: identity_email
    type: exact
    threshold: 0.5    # required when negative_evidence is set
    fields:
      - field: email
        transforms: [lowercase]
        scorer: exact
        weight: 1.0
    negative_evidence:
      - field: phone
        transforms: [digits_only]
        scorer: exact
        threshold: 0.4
        penalty: 0.3
      - field: address
        transforms: []
        scorer: token_sort
        threshold: 0.4
        penalty: 0.4

Auto-config populates this automatically: promote_negative_evidence walks all matchkeys at config-build time and adds NE for high-identity-prior columns (email/phone/address etc.) that aren't already participating positively. T3-class adversarial datasets where the same email is shared across distinct people — same name, different phone, different address — get filtered at the exact matchkey level via cumulative NE penalty.

Disabling the controller

If you want the legacy v0 heuristic only (no iteration, no LLM, no memory):

from goldenmatch.core.autoconfig import _legacy_auto_configure_v0
cfg = _legacy_auto_configure_v0(df)

Or pass an explicit config to dedupe_df/match_df to bypass auto-config entirely.

Architecture

The controller is implemented in:

  • goldenmatch/core/autoconfig_controller.py — iteration loop, sample selection, finalize
  • goldenmatch/core/autoconfig_policy.pyRefitPolicy protocol, HeuristicRefitPolicy, LLMRefitPolicy
  • goldenmatch/core/autoconfig_rules.py — 14 ordered refit rules (10 + 3 indicator-aware in v1.10 + 1 clustered-identity-guard in v1.11)
  • goldenmatch/core/autoconfig_negative_evidence.py — eager promote_negative_evidence rule + _pick_scorer_for_column (v1.11+v1.12)
  • goldenmatch/core/scorer.py_apply_negative_evidence (weighted matchkeys, v1.11) + _apply_negative_evidence_to_exact_pairs (Path Y post-filter, v1.12)
  • goldenmatch/core/complexity_profile.py — typed sub-profiles + rollup
  • goldenmatch/core/profile_emitter.py — thread-local emitter stack
  • goldenmatch/core/autoconfig_history.py — audit trail
  • goldenmatch/core/autoconfig_memory.py — cross-run persistence

See Architecture for the broader pipeline.

GoldenMatch

PyPI npm

🟡 Golden Suite (Monorepo)

Suite Packages

Getting Started

Core Concepts

AI Integration

Advanced

Reference


pip install goldenmatch
npm install goldenmatch

Clone this wiki locally