Scoring & Survivorship Quality Upgrade — MST splitting, quality-weighted golden records, field provenance #31

benzsevern · 2026-04-06T12:45:08Z

benzsevern
Apr 6, 2026
Maintainer

Coming in the next release

PR #30 adds major improvements to cluster quality and golden record generation.

Cluster Quality

MST-based auto-splitting — oversized clusters are automatically split by removing the weakest edge in the minimum spanning tree. Guaranteed to disconnect. Recursive until all sub-clusters are within max_cluster_size.
Cluster quality labels — every cluster gets cluster_quality: "strong", "weak", or "split". Weak clusters (large gap between min and avg edge weight) get confidence downgraded by 0.7x.
Configurable — auto_split, quality_weighting, and weak_cluster_threshold in GoldenRulesConfig.

Quality-Weighted Survivorship

All 5 merge strategies now accept optional quality weights from GoldenCheck:

most_complete — ties broken by source quality
majority_vote — votes weighted by quality
source_priority — quality as tiebreaker within same priority level
most_recent — confidence scales by quality
first_non_null — picks highest-quality source first

When GoldenCheck is not installed, everything works exactly as before (optional-but-rewarded).

Field-Level Provenance

build_golden_record_with_provenance() tracks, per cluster, which source row contributed each field value, what strategy was used, and what candidates were available. Provenance is serialized in the lineage JSON sidecar as a golden_records section.

Backward Compatible

All new parameters have defaults. Existing callers unchanged. 85 tests pass.

PR: #30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scoring & Survivorship Quality Upgrade — MST splitting, quality-weighted golden records, field provenance #31

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Scoring & Survivorship Quality Upgrade — MST splitting, quality-weighted golden records, field provenance #31

Uh oh!

benzsevern Apr 6, 2026 Maintainer

Coming in the next release

Cluster Quality

Quality-Weighted Survivorship

Field-Level Provenance

Backward Compatible

Replies: 0 comments

benzsevern
Apr 6, 2026
Maintainer