Scoring & Survivorship Quality Upgrade — MST splitting, quality-weighted golden records, field provenance #31
benzsevern
announced in
Announcements
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Coming in the next release
PR #30 adds major improvements to cluster quality and golden record generation.
Cluster Quality
max_cluster_size.cluster_quality:"strong","weak", or"split". Weak clusters (large gap between min and avg edge weight) get confidence downgraded by 0.7x.auto_split,quality_weighting, andweak_cluster_thresholdinGoldenRulesConfig.Quality-Weighted Survivorship
All 5 merge strategies now accept optional quality weights from GoldenCheck:
most_complete— ties broken by source qualitymajority_vote— votes weighted by qualitysource_priority— quality as tiebreaker within same priority levelmost_recent— confidence scales by qualityfirst_non_null— picks highest-quality source firstWhen GoldenCheck is not installed, everything works exactly as before (optional-but-rewarded).
Field-Level Provenance
build_golden_record_with_provenance()tracks, per cluster, which source row contributed each field value, what strategy was used, and what candidates were available. Provenance is serialized in the lineage JSON sidecar as agolden_recordssection.Backward Compatible
All new parameters have defaults. Existing callers unchanged. 85 tests pass.
PR: #30
Beta Was this translation helpful? Give feedback.
All reactions