Skip to content

Commit cc8ba4c

Browse files
authored
feat(js): auto-config verification layer (preflight + postflight) for goldenmatch-js v0.3.0 (#48)
* feat(js-profiler): add confidence field to ColumnProfile Prerequisite for upcoming autoconfig verification layer (Check 6 + weight cap in classifier fixes). Matches Python's confidence semantics: 0.9 (both heuristics agree), 0.7 (one heuristic), 0.3 (string fallthrough). * feat(js-autoconfig): cardinality guard for identifier classification * feat(js-autoconfig): extend ID_NAME_PATTERNS for voter_reg_num / account_no / guid / uuid * feat(js-autoconfig): add year col_type (blocking signal for biblio data) * feat(js-autoconfig): multi_name col_type for delimited author-style fields * feat(js-autoconfig): cap weight at 0.3 for low-confidence fields * feat(js-verify): add autoconfigVerify module skeleton with types Types, dataclasses, ConfigValidationError, makePreflightReport factory, stripConventionPrivate utility. preflight/postflight bodies stubbed; real implementations land in Phase 2 and Phase 3. * feat(js-verify): add underscore escape-hatch fields + postflightReport to results GoldenMatchConfig gains _preflightReport / _strictAutoconfig / _domainProfile as non-readonly optional fields (minimum escape hatch for strict-readonly config; see spec \u00a77). DedupeResult and MatchResult gain optional readonly postflightReport. List is closed \u2014 future internal state uses side-table pattern instead. * feat(js-verify): re-export autoconfigVerify public symbols * feat(js-verify): preflight Check 1 — column resolution + domain auto-repair - Add DOMAIN_EXTRACTED_COLS constant to domain.ts (__brand__, __model__, __version__) so preflight can distinguish 'missing but producible' column references from hard errors. - Replace preflight stub with a real implementation that walks every matchkey + blocking key reference, flags anything not present in the first row, and auto-repairs config.domain when _domainProfile is stashed on the config and a __<col>__ reference turns up. - Keep types.ts -> autoconfigVerify.ts runtime cycle broken by using only 'import type' from types.ts inside autoconfigVerify.ts. * feat(js-verify): preflight Checks 2 & 3 — cardinality bounds on exact matchkeys Drop exact matchkeys whose referenced column is either near-unique (cardinality_ratio >= 0.99 — every row its own block) or near-constant (ratio <= 0.01 — one giant block). Both drops are repaired warnings. If the drops empty the matchkey list, emit a no_matchkeys_remain error so ConfigValidationError fires downstream instead of producing a silent no-op run. * feat(js-verify): preflight Check 4 — block-size sanity Walks every BlockingKeyConfig, groups a <= 10k-row sample by raw field concatenation, and warns when the p99 block size exceeds 5000 (scoring will be slow) or the p50 is <2 (blocking is too selective, most records end up alone). Raw-value grouping is a coarse proxy for the real transform-applied blocker but catches the typical 'everyone has the same state' / 'ID column used as blocker' failures. * feat(js-verify): preflight Check 5 — demote remote-asset scorers Walks every matchkey field and demotes/drops scorers that depend on remote model downloads: - 'embedding' -> 'ensemble' (in-place scorer swap) - 'record_embedding' fields dropped entirely - weighted matchkey rerank=true -> false Skipped entirely when allowRemoteAssets=true or llmScorer.enabled=true (caller has already committed to remote round-trips). Matchkeys that lose all their fields to the demotion emit remote_asset_matchkey_empty and are removed. * feat(js-verify): preflight Check 6 — cap weight for low-confidence fields Looks up each weighted matchkey field against the passed-in profiles (optional — no-op when profiles absent). If the classifier confidence is <0.5 and the configured weight is >0.5, cap the weight at 0.5. This keeps a field that was classified with low confidence from dominating the weighted score just because it happened to land with a high weight during autoconfig construction. * feat(js-verify): integrate preflight into autoConfigure(Rows) - Extend AutoconfigOptions with optional strict + allowRemoteAssets flags. - Detect domain from columns and stash _domainProfile on the config when confidence >0.7, so preflight Check 1 can auto-repair __<col>__ references. - Run preflight at the end of autoConfigureRows with the profiled ColumnProfile list as 'profiles' and the caller's allowRemoteAssets setting; throw ConfigValidationError on unrepairable errors. - Stamp _preflightReport on the returned config; stamp _strictAutoconfig only when strict=true. - Update two pre-existing autoconfig tests to reflect preflight's new effect: exact_email / exact_phone matchkeys built from 100%-unique fixtures are now dropped as cardinality_high (repaired warning in the preflight report, weighted_identity still present). * feat(js-verify): postflight score histogram + bimodality threshold nudge * feat(js-verify): blocking-recall signal (deferred sentinel) * feat(js-verify): preliminary cluster-size signal with bottleneck pair * feat(js-verify): postflight threshold-band overlap signal * feat(js-verify): postflight orchestrator complete + signals-schema contract test * feat(js-verify): run postflight in runDedupe/runMatch pipelines _applyPostflight helper shared between both pipelines (mirrors Python's _apply_postflight). isPreflightReport guard rejects stale/wrong-type objects. Threshold adjustments apply to pairScores before clustering; empty-pair case logs an advisory. runMatchPipeline threads postflightReport through from the delegated runDedupePipeline result. * test(parity): export Python autoconfigVerify fixtures for TS parity * test(js-verify): parity harness vs Python autoconfigVerify fixtures * test(js-verify): property-based invariants for preflight / postflight * docs(js-examples): verificationInspection + strictModeParity examples * release: goldenmatch-js v0.3.0 — autoconfig verification layer README: new Verification section. Version: 0.1.0 -> 0.3.0. JSDoc: every exported symbol in autoconfigVerify.ts documented.
1 parent 1ced18d commit cc8ba4c

27 files changed

Lines changed: 4117 additions & 35 deletions

packages/goldenmatch-js/README.md

Lines changed: 54 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ npm install goldenmatch
99
[![npm](https://img.shields.io/npm/v/goldenmatch?color=d4a017)](https://www.npmjs.com/package/goldenmatch)
1010
[![Node](https://img.shields.io/node/v/goldenmatch?color=339933)](https://nodejs.org/)
1111
[![License: MIT](https://img.shields.io/badge/license-MIT-green)](https://github.com/benzsevern/goldenmatch/blob/main/LICENSE)
12-
[![Tests](https://img.shields.io/badge/tests-478%20passing-brightgreen)](https://github.com/benzsevern/goldenmatch/tree/main/packages/goldenmatch-js/tests)
12+
[![Tests](https://img.shields.io/badge/tests-590%20passing-brightgreen)](https://github.com/benzsevern/goldenmatch/tree/main/packages/goldenmatch-js/tests)
1313

1414
---
1515

@@ -18,7 +18,7 @@ npm install goldenmatch
1818
- **Edge-safe core** — the matching engine runs in browsers, Workers, Vercel Edge Runtime, Deno
1919
- **Pure TypeScript** — no native dependencies required; peer deps unlock performance (hnswlib, ONNX, piscina)
2020
- **Feature parity with Python goldenmatch** — same scorers, same clustering, same YAML configs
21-
- **478 tests, strict TypeScript**`noUncheckedIndexedAccess`, `exactOptionalPropertyTypes`
21+
- **590 tests, strict TypeScript**`noUncheckedIndexedAccess`, `exactOptionalPropertyTypes`
2222

2323
## Quick Start
2424

@@ -45,6 +45,58 @@ for (const record of result.goldenRecords) {
4545
}
4646
```
4747

48+
## Auto-Config Verification (v0.3)
49+
50+
Auto-generated configs are now checked both before the pipeline runs and after
51+
scoring finishes, so you get actionable diagnostics instead of silent failures
52+
on edge-case data.
53+
54+
### Preflight — six static checks
55+
56+
When you call `autoConfigureRows(rows)`, the returned config ships with a
57+
`_preflightReport` summarising six config-time checks:
58+
59+
1. **missing_column** — matchkey/blocking references a column not in the data
60+
2. **cardinality_high** — a column is near-unique (poor blocking signal)
61+
3. **cardinality_low** — a column has too few distinct values to discriminate
62+
4. **block_size** — a blocking key would produce oversized blocks
63+
5. **remote_asset** — a scorer requires a model download (gated offline)
64+
6. **weight_confidence** — a weighted matchkey's weights look unbalanced
65+
66+
Many findings trigger **auto-repairs** (field dropped, scorer swapped,
67+
weight clamped). `hasErrors === true` on unrepairable errors raises
68+
`ConfigValidationError` with the full report attached.
69+
70+
```ts
71+
import { autoConfigureRows, ConfigValidationError } from "goldenmatch";
72+
73+
const cfg = autoConfigureRows(rows);
74+
for (const f of cfg._preflightReport!.findings) {
75+
console.log(`[${f.severity}] ${f.check}/${f.subject}: ${f.message}`);
76+
}
77+
```
78+
79+
Defaults are **offline-safe**: remote-asset scorers (cross-encoder, remote
80+
embeddings) are dropped unless you opt in with `allowRemoteAssets: true`.
81+
82+
### Postflight — four runtime signals
83+
84+
Inside `dedupe()` / `match()`, after scoring but before clustering, the
85+
pipeline computes four signals attached as `result.postflightReport`:
86+
87+
1. **scoreHistogram** — 100-bin pair-score distribution
88+
2. **blockSizePercentiles** + **preliminaryClusterSizes** — p50/p95/p99/max
89+
3. **thresholdOverlapPct** — fraction of pairs near the current threshold
90+
4. **oversizedClusters** — components above size limit, with bottleneck pair
91+
92+
If the score distribution is clearly bimodal, postflight proposes a
93+
threshold adjustment. In **strict mode** (`autoConfigureRows(rows, { strict: true })`
94+
or manual `_strictAutoconfig: true`) the signals are still emitted but the
95+
threshold is never touched — use this for reproducible CI pipelines.
96+
97+
See `examples/verificationInspection.ts` and `examples/strictModeParity.ts`
98+
for runnable demos.
99+
48100
## Three entrypoints
49101

50102
```typescript

packages/goldenmatch-js/examples/README.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,8 @@ npx tsx examples/<name>.ts
1919
| 09 | `09-llm-scorer.ts` | LLM scorer for borderline pairs (needs OPENAI_API_KEY) |
2020
| 10 | `10-explain.ts` | Template NL explanation of a pair match |
2121
| 11 | `11-evaluate.ts` | Evaluate against ground truth (precision/recall/F1) |
22+
| 12 | `verificationInspection.ts` | Inspect preflight findings + postflight signals |
23+
| 13 | `strictModeParity.ts` | Use `_strictAutoconfig` to disable runtime threshold shifts |
2224

2325
## Running
2426

Lines changed: 78 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,78 @@
1+
/**
2+
* Strict-mode auto-config: show how `_strictAutoconfig: true` disables
3+
* runtime postflight threshold adjustments, yielding byte-identical-shape
4+
* configs suitable for CI / reproducible pipelines.
5+
*
6+
* Compares two runs on the same data:
7+
* (a) normal auto-config — postflight may shift threshold
8+
* (b) strict auto-config — postflight reports signals but adjusts nothing
9+
*
10+
* Run: npx tsx examples/strictModeParity.ts
11+
*/
12+
import { autoConfigureRows, dedupe } from "goldenmatch";
13+
import type { Row } from "goldenmatch";
14+
15+
// Bimodal synthetic data: three tight duplicate clusters + scattered singletons.
16+
// The bimodal score distribution is what tempts postflight to shift threshold.
17+
const rows: Row[] = [
18+
{ id: 1, name: "John Smith", email: "john@a.com", zip: "10001" },
19+
{ id: 2, name: "Jon Smith", email: "john@a.com", zip: "10001" },
20+
{ id: 3, name: "Johnny Smith", email: "JOHN@a.com", zip: "10001" },
21+
{ id: 4, name: "Jane Doe", email: "jane@b.com", zip: "20002" },
22+
{ id: 5, name: "Jane Doh", email: "jane@b.com", zip: "20002" },
23+
{ id: 6, name: "Janet Doe", email: "jane@b.com", zip: "20002" },
24+
{ id: 7, name: "Bob Jones", email: "bob@c.com", zip: "30003" },
25+
{ id: 8, name: "Robert Jones", email: "bob@c.com", zip: "30003" },
26+
{ id: 9, name: "Alice Zhang", email: "alice@d.com", zip: "40004" },
27+
{ id: 10, name: "Carlos Ruiz", email: "c@e.com", zip: "50005" },
28+
{ id: 11, name: "Dana White", email: "dana@f.com", zip: "60006" },
29+
{ id: 12, name: "Eve Black", email: "eve@g.com", zip: "70007" },
30+
];
31+
32+
function run(
33+
label: string,
34+
strict: boolean,
35+
): void {
36+
console.log("\n" + "=".repeat(60));
37+
console.log(`${label} (strict=${strict})`);
38+
console.log("=".repeat(60));
39+
40+
const cfg = autoConfigureRows(rows, { strict });
41+
console.log(`_strictAutoconfig flag on config: ${cfg._strictAutoconfig === true}`);
42+
43+
const result = dedupe(rows, { config: cfg });
44+
const post = result.postflightReport;
45+
46+
console.log(`clusters: ${result.stats.totalClusters}`);
47+
console.log(`match rate: ${(result.stats.matchRate * 100).toFixed(1)}%`);
48+
49+
if (post === undefined) {
50+
console.log("no postflight report");
51+
return;
52+
}
53+
console.log(`threshold used: ${post.signals.currentThreshold}`);
54+
console.log(`adjustments proposed: ${post.adjustments.length}`);
55+
for (const adj of post.adjustments) {
56+
console.log(
57+
` - ${adj.field}: ${adj.fromValue} -> ${adj.toValue} (${adj.reason})`,
58+
);
59+
}
60+
if (strict && post.adjustments.length === 0) {
61+
console.log(" -> strict mode: no adjustments applied (as expected)");
62+
}
63+
for (const adv of post.advisories) {
64+
console.log(` advisory: ${adv}`);
65+
}
66+
}
67+
68+
run("A. normal auto-config", false);
69+
run("B. strict auto-config", true);
70+
71+
console.log("\n" + "=".repeat(60));
72+
console.log("takeaway");
73+
console.log("=".repeat(60));
74+
console.log(
75+
"Use strict=true when you need reproducible, deterministic configs " +
76+
"across environments (CI, prod). Postflight still emits diagnostic " +
77+
"signals, but the pipeline will not silently shift your threshold.",
78+
);
Lines changed: 99 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,99 @@
1+
/**
2+
* Inspect auto-config verification signals — preflight + postflight.
3+
*
4+
* Shows how to:
5+
* 1. Auto-configure a config from raw rows.
6+
* 2. Read `cfg._preflightReport` before dedupe runs.
7+
* 3. Run dedupe with the auto-configured config.
8+
* 4. Read `result.postflightReport` produced by the pipeline.
9+
*
10+
* Run: npx tsx examples/verificationInspection.ts
11+
*/
12+
import { autoConfigureRows, dedupe } from "goldenmatch";
13+
14+
// A small synthetic dataset with deliberate duplicates + noisy fields.
15+
const rows = [
16+
{ id: 1, name: "John Smith", email: "john@x.com", zip: "12345" },
17+
{ id: 2, name: "Jon Smith", email: "JOHN@X.COM", zip: "12345" },
18+
{ id: 3, name: "Jane Doe", email: "jane@y.com", zip: "54321" },
19+
{ id: 4, name: "Jane Doh", email: "jane@y.com", zip: "54321" },
20+
{ id: 5, name: "Robert Brown", email: "bob@z.com", zip: "99999" },
21+
{ id: 6, name: "Rob Brown", email: "bob@z.com", zip: "99999" },
22+
{ id: 7, name: "Alice Jones", email: "alice@a.com", zip: "11111" },
23+
{ id: 8, name: "Alicia Jones", email: "alice@a.com", zip: "11111" },
24+
];
25+
26+
console.log("=".repeat(60));
27+
console.log("STEP 1 — auto-configure from rows");
28+
console.log("=".repeat(60));
29+
30+
const cfg = autoConfigureRows(rows);
31+
console.log(`matchkeys: ${cfg.matchkeys?.length ?? 0}`);
32+
console.log(`blocking strategy: ${cfg.blocking?.strategy}`);
33+
console.log(`threshold: ${cfg.threshold}`);
34+
35+
console.log("\n" + "=".repeat(60));
36+
console.log("STEP 2 — preflight report (config-time checks)");
37+
console.log("=".repeat(60));
38+
39+
const pre = cfg._preflightReport;
40+
if (pre === undefined) {
41+
console.log("no preflight report attached");
42+
} else {
43+
console.log(`findings: ${pre.findings.length}`);
44+
console.log(`configWasModified: ${pre.configWasModified}`);
45+
console.log(`hasErrors: ${pre.hasErrors}`);
46+
for (const f of pre.findings) {
47+
console.log(
48+
` - [${f.severity}] ${f.check} / ${f.subject}: ${f.message}` +
49+
(f.repaired ? ` (repaired: ${f.repairNote ?? "auto"})` : ""),
50+
);
51+
}
52+
}
53+
54+
console.log("\n" + "=".repeat(60));
55+
console.log("STEP 3 — run dedupe with the auto-configured config");
56+
console.log("=".repeat(60));
57+
58+
const result = dedupe(rows, { config: cfg });
59+
console.log(`input: ${result.stats.totalRecords} rows`);
60+
console.log(`clusters: ${result.stats.totalClusters}`);
61+
console.log(`match rate: ${(result.stats.matchRate * 100).toFixed(1)}%`);
62+
63+
console.log("\n" + "=".repeat(60));
64+
console.log("STEP 4 — postflight report (runtime signals)");
65+
console.log("=".repeat(60));
66+
67+
const post = result.postflightReport;
68+
if (post === undefined) {
69+
console.log("no postflight report attached (expected when no preflight ran)");
70+
} else {
71+
const s = post.signals;
72+
console.log(`totalPairsScored: ${s.totalPairsScored}`);
73+
console.log(`currentThreshold: ${s.currentThreshold}`);
74+
console.log(`blockingRecall: ${s.blockingRecall}`);
75+
console.log(
76+
`blockSize p50/p95/p99/max: ${s.blockSizePercentiles.p50}/` +
77+
`${s.blockSizePercentiles.p95}/${s.blockSizePercentiles.p99}/` +
78+
`${s.blockSizePercentiles.max}`,
79+
);
80+
console.log(
81+
`clusterSize count/p50/p95/max: ${s.preliminaryClusterSizes.count}/` +
82+
`${s.preliminaryClusterSizes.p50}/${s.preliminaryClusterSizes.p95}/` +
83+
`${s.preliminaryClusterSizes.max}`,
84+
);
85+
console.log(
86+
`thresholdOverlapPct: ${(s.thresholdOverlapPct * 100).toFixed(2)}%`,
87+
);
88+
console.log(`oversizedClusters: ${s.oversizedClusters.length}`);
89+
console.log(`adjustments applied: ${post.adjustments.length}`);
90+
for (const adj of post.adjustments) {
91+
console.log(
92+
` - ${adj.field}: ${adj.fromValue} -> ${adj.toValue} ` +
93+
`(${adj.signal}: ${adj.reason})`,
94+
);
95+
}
96+
for (const adv of post.advisories) {
97+
console.log(` advisory: ${adv}`);
98+
}
99+
}

packages/goldenmatch-js/package.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
{
22
"name": "goldenmatch",
3-
"version": "0.1.0",
3+
"version": "0.3.0",
44
"description": "Entity resolution toolkit — deduplicate, match, and create golden records",
55
"type": "module",
66
"exports": {

0 commit comments

Comments
 (0)