-
-
Notifications
You must be signed in to change notification settings - Fork 11
Database Integration
GoldenMatch can run entity resolution against live Postgres databases, with incremental matching so only new records are processed on each run.
pip install goldenmatch[postgres]# First run: full scan, creates metadata tables
goldenmatch sync \
--table customers \
--connection-string "postgresql://user:pass@localhost:5432/mydb" \
--config config.yaml
# Subsequent runs: incremental (only new records)
goldenmatch sync \
--table customers \
--connection-string "$DATABASE_URL"goldenmatch sync --source-type postgres --connection-string "$DATABASE_URL" --table customerssource:
type: postgres
connection: postgresql://user:pass@host:5432/mydb
table: customers
incremental_column: updated_atexport GOLDENMATCH_DATABASE_URL=postgresql://user:pass@host/db
goldenmatch sync --table customers- First run: Full table scan → match all records → build clusters → create golden records
- Second run: Read only records added since last run → match against existing clusters → update golden records
- Progressive embedding: Each run embeds 100K existing records in the background. ANN blocking becomes available once 10% of records are embedded.
For each new record, GoldenMatch uses two blocking strategies in parallel:
-
SQL blocking: Translates blocking keys into
WHEREclauses (soundex, substring, exact) - ANN blocking: Queries persistent FAISS index for semantically similar records
Results are unioned for maximum recall.
GoldenMatch creates and manages these tables (all prefixed with gm_):
Tracks processing state for incremental sync.
| Column | Purpose |
|---|---|
| source_table | Which table was processed |
| last_processed_at | When last run completed |
| last_incremental_value | Watermark for incremental detection |
| config_hash | Detects config changes |
Persistent cluster membership.
| Column | Purpose |
|---|---|
| cluster_id | Cluster identifier |
| record_id | Member record ID |
| source_table | Source table |
| run_id | Which sync run added this |
Versioned golden records with append-only history.
| Column | Purpose |
|---|---|
| cluster_id | Which cluster |
| source_ids | All member record IDs |
| record_data | Merged field values (JSONB) |
| is_current | TRUE for latest version |
| version | Version number |
Query current golden records:
SELECT * FROM gm_golden_records WHERE is_current = TRUE;Query cluster history:
SELECT * FROM gm_golden_records WHERE cluster_id = 42 ORDER BY version;Cached embeddings for ANN blocking.
Audit trail of all match decisions.
| Column | Purpose |
|---|---|
| record_id_a, record_id_b | Matched pair |
| score | Match score |
| action | merged, new, conflict, skipped |
| run_id | Which sync run |
Results written to gm_golden_records. Source table is never modified.
output:
mode: separateAdds __cluster_id__ and __is_golden__ columns to the source table.
output:
mode: in_placeWhen a new record matches an existing cluster:
- Single match: Record is added to the cluster, golden record re-computed
- Multiple cluster match: If merged size ≤ max_cluster_size, clusters are merged. Otherwise, assigned to best-scoring cluster and conflict is logged.
- No match: New single-record cluster created
| Flag | Description |
|---|---|
--source-type |
Database type (postgres) |
--connection-string |
Database URL |
--table |
Source table name |
--config |
Matching config YAML |
--output-mode |
separate or in_place |
--full-rescan |
Force reprocess all records |
--dry-run |
Match without writing |
--incremental-column |
Column for incremental detection |
--chunk-size |
Records per chunk (default 10000) |
⚡ GoldenMatch — Entity resolution toolkit | PyPI | GitHub | Open in Colab | MIT License
🟡 Golden Suite (Monorepo)
Suite Packages
- GoldenCheck · data quality
- GoldenFlow · transforms
- GoldenPipe · orchestrator
- InferMap · schema mapping
Getting Started
- Installation
- Quick Start
- Auto-Config Controller · enhanced through v1.12
- Configuration
- Verification · new in v1.5
- CLI Reference
Core Concepts
AI Integration
Advanced
- PPRL
- Domain Packs
- Streaming / CDC
- Database Integration
- GPU & Vertex AI
- REST API
- Interactive TUI
- Web UI · new in v1.7
- Evaluation
Reference
pip install goldenmatch
npm install goldenmatch