|
| 1 | +--- |
| 2 | +name: entity-resolution |
| 3 | +title: Resolve Entities |
| 4 | +summary: End-to-end entity resolution pipeline using Snowflake Cortex AI Functions to match, link, and dedupe records. |
| 5 | +description: Use when you need to match records across datasets, deduplicate within a dataset, build a golden record, or link source records to a reference corpus. Orchestrates profiling, normalization, blocking, multi-tier matching (deterministic, fuzzy, AI-judged, agentic, contrastive), human review, and operationalization via dynamic tables. Industry-agnostic with optional domain profiles for pharma, financial services, retail/CPG, healthcare, and insurance. Triggers: entity resolution, record matching, deduplication, record linkage, fuzzy matching, golden record, master data, MDM, merge records, match entities, link records, dedupe, duplicate detection, identity resolution. |
| 6 | +tools: |
| 7 | + - snowflake_sql_execute |
| 8 | + - Bash |
| 9 | + - Read |
| 10 | + - Write |
| 11 | + - Edit |
| 12 | + - Glob |
| 13 | + - Grep |
| 14 | +prompt: Help me resolve entities across my customer and prospect tables. |
| 15 | +language: en |
| 16 | +status: Published |
| 17 | +author: Snowflake Solutions Team |
| 18 | +type: snowflake |
| 19 | +--- |
| 20 | + |
| 21 | +# Resolve Entities |
| 22 | + |
| 23 | +## Overview |
| 24 | + |
| 25 | +Determine whether records refer to the same real-world entity. This skill orchestrates Cortex AI Functions, dynamic tables, Streamlit, and optionally Cortex Agents and Snowflake ML through a structured pipeline: profile → normalize → block → match → score → review → operationalize. |
| 26 | + |
| 27 | +Two workflow paths: |
| 28 | + |
| 29 | +- **Path A — Pair-Based Matching (Deduplication / Cross-Match):** generate candidate pairs via blocking, score through 3 tiers. |
| 30 | +- **Path B — Agentic Matching (Entity Linking):** resolve source records against a reference corpus via tiered escalation. |
| 31 | + |
| 32 | +Optional add-on: |
| 33 | + |
| 34 | +- **Contrastive Embeddings:** train a domain-adapted encoder via SupConLoss when labeled data and GPU compute are available. |
| 35 | + |
| 36 | +## When to Use |
| 37 | + |
| 38 | +- Match or link records across datasets |
| 39 | +- Deduplicate records within a single dataset |
| 40 | +- Build a golden record from multiple sources |
| 41 | +- Assess match-readiness of source data |
| 42 | +- Operationalize an ongoing matching pipeline |
| 43 | + |
| 44 | +For unstructured inputs (PDFs, scans), call `cortex-ai-functions` (`AI_EXTRACT`, `AI_PARSE_DOCUMENT`) first, then feed structured output here. |
| 45 | + |
| 46 | +## Domain Profiles |
| 47 | + |
| 48 | +Load the matching profile from `references/profiles/`: |
| 49 | + |
| 50 | +| Keywords | Profile | Tier 1 IDs | |
| 51 | +|---|---|---| |
| 52 | +| pharma, NPI, DEA, NCPDP | `pharma.md` | NPI, DEA, NCPDP | |
| 53 | +| bank, KYC, AML, LEI | `financial-services.md` | LEI, DUNS, Tax ID, CRD | |
| 54 | +| retail, CPG, GTIN, UPC | `retail-cpg.md` | GTIN/UPC, GLN, Supplier ID | |
| 55 | +| provider, hospital, taxonomy | `healthcare-provider.md` | NPI, Taxonomy Code | |
| 56 | +| insurance, payer, NAIC | `insurance-payer.md` | NAIC, CMS Payer ID, Plan ID | |
| 57 | +| *(none)* | `generic.md` | Name + address only | |
| 58 | + |
| 59 | +## Workflow |
| 60 | + |
| 61 | +### Step 0: Discovery |
| 62 | + |
| 63 | +Use `ask_user_question` to collect: goal (dedupe / cross-match / link to reference / profile-only), domain, source and reference tables, volume, output schema, HITL preference, pipeline cadence (one-time vs ongoing), data language, and approach (recommended / specific / benchmark multiple). Present a discovery summary. |
| 64 | + |
| 65 | +⚠️ STOPPING POINT: User confirms the summary before any work begins. |
| 66 | + |
| 67 | +### Step 1: Profiling |
| 68 | + |
| 69 | +Delegate to `profiling-tables`. ER-specific checks: |
| 70 | + |
| 71 | +1. Identifier detection from the loaded domain profile (Tier 1 candidates) |
| 72 | +2. Name/address column detection via `AI_CLASSIFY` |
| 73 | +3. Completeness, format consistency, duplicate density, volume |
| 74 | + |
| 75 | +⚠️ STOPPING POINT: Present profiling report and match-readiness checklist. |
| 76 | + |
| 77 | +### Step 1b: Cost Estimation |
| 78 | + |
| 79 | +Load `references/templates/cost-estimator.md`. Estimate based on volume, path, expected tier distribution, and warehouse sizing. |
| 80 | + |
| 81 | +⚠️ STOPPING POINT: User acknowledges and accepts the estimate. |
| 82 | + |
| 83 | +### Step 2: Normalization |
| 84 | + |
| 85 | +Delegate to `cortex-ai-functions` for `AI_EXTRACT` (address parsing) and `AI_COMPLETE` (edge-case names). Load `references/templates/normalization.md`. Materialize `normalized_entities` with `source_id`, normalized fields, raw originals, and `blocking_key`. |
| 86 | + |
| 87 | +⚠️ STOPPING POINT: Validate 10–20 sample rows, NULL counts per normalized field, and identifier format compliance. |
| 88 | + |
| 89 | +### Step 3: Blocking |
| 90 | + |
| 91 | +Load `references/templates/blocking.md`. Reduce O(n²) pair space via blocking keys (geographic, category, phonetic, or ID prefix). Self-join within blocks: `a.source_id < b.source_id`. |
| 92 | + |
| 93 | +⚠️ STOPPING POINT: Review blocking statistics. Red flags: any block > 100K pairs, reduction ratio > 10%, or very few pairs. |
| 94 | + |
| 95 | +### Step 4: Multi-Tier Matching (Path A) |
| 96 | + |
| 97 | +Load `references/templates/matching.md`. |
| 98 | + |
| 99 | +- **Tier 1 — Deterministic:** exact match on authoritative IDs. Pure SQL, no AI cost. |
| 100 | +- **Tier 2 — Fuzzy:** `AI_EMBED` (`snowflake-arctic-embed-l-v2.0`) + `VECTOR_COSINE_SIMILARITY`, supplemented by `JAROWINKLER_SIMILARITY` on name/street. Starting thresholds: ≥ 0.92 match, ≥ 0.80 probable_match, < 0.80 no_match. |
| 101 | +- **Tier 3 — AI-Judged:** `AI_CLASSIFY` on Tier 2 `probable_match` rows only (cost control). |
| 102 | + |
| 103 | +Resolve transitive matches and assign `entity_group_id` to connected components. |
| 104 | + |
| 105 | +⚠️ STOPPING POINT: Review counts by tier and decision, sample rows, and threshold tuning (adjust in 0.02–0.03 increments against a labeled sample). |
| 106 | + |
| 107 | +### Step 4b: Agentic Matching (Path B) |
| 108 | + |
| 109 | +Load `references/templates/agentic-matching.md`. Prerequisites: normalized entities, embeddings on source and reference, top-N candidates, Cortex Search Service over the reference corpus (`references/templates/search-service.md`), and a semantic model YAML. |
| 110 | + |
| 111 | +- **Tier 1 — High-Confidence Triage:** cosine + Jaro-Winkler with domain guards (chain stores, multi-tenant buildings, address floor). |
| 112 | +- **Tier 1.5 — Search + Classify:** Cortex Search top-N then `AI_COMPLETE` (cost-effective model). Confidence ≥ 0.80 + name/address alignment. |
| 113 | +- **Tier 2 — Cortex Agent:** 3 tools — `cortex_search` (primary), `cortex_analyst_text_to_sql` (fallback), `web_search` (last resort). Budget: 6 tool calls, 90s, 16K tokens per entity. See `references/templates/agent-definition.md` and `references/templates/orchestration.md`. Delegate to `cortex-agent`. |
| 114 | + |
| 115 | +UNION ALL all tiers into a crosswalk with tier attribution. Records entities confirmed active but missing from the reference corpus as discoveries. |
| 116 | + |
| 117 | +⚠️ STOPPING POINT: Review per-tier match/closure/discovery distributions and web search usage. |
| 118 | + |
| 119 | +### Step 4c: Contrastive Embeddings (Standalone or Tier 2 Replacement) |
| 120 | + |
| 121 | +Load `references/templates/contrastive-embeddings.md`. Prerequisites: ≥ 500 labeled entities across ≥ 200 clusters, GPU pool (`GPU_NV_S`), `PYPI_EAI` and `HF_EAI` external access integrations. |
| 122 | + |
| 123 | +1. **Model selection:** English-only → `roberta-base` (NER off); multilingual → `xlm-roberta-base` (NER on); resource-constrained → `all-MiniLM-L6-v2`. |
| 124 | +2. **Serialize** entities with `[COL]/[VAL]` tokens; derive cluster IDs via Union-Find. |
| 125 | +3. **Train** via stored procedure on the GPU pool — delegate to `machine-learning` for setup and monitoring. |
| 126 | +4. **Block** by cosine ≥ 0.50, run threshold sweep against ground truth, materialize matches at optimal F1. |
| 127 | +5. **Add-on mode:** replace `AI_EMBED` in Tier 2 of Path A; use match/no-match thresholds with escalation to Tier 3 in between. |
| 128 | + |
| 129 | +⚠️ STOPPING POINT: Review threshold sweep, optimal F1/precision/recall, and (if add-on) comparison with `AI_EMBED` Tier 2. |
| 130 | + |
| 131 | +### Step 5: Human-in-the-Loop Review |
| 132 | + |
| 133 | +Delegate to `developing-with-streamlit`. Load `references/hitl-app.md`. App requirements: |
| 134 | + |
| 135 | +- Side-by-side source vs. resolved entity with field-level match indicators |
| 136 | +- Accept / reject / flag with optional comment |
| 137 | +- Sequential nav, progress tracking |
| 138 | +- Decisions persisted to `REVIEW_DECISIONS` |
| 139 | +- Material Design CSS, no emojis, no third-party MUI |
| 140 | + |
| 141 | +⚠️ STOPPING POINT: Deploy app and wait for the review cycle to complete. |
| 142 | + |
| 143 | +### Step 6: Operationalize |
| 144 | + |
| 145 | +Load `references/operationalize.md`. |
| 146 | + |
| 147 | +1. **Dynamic tables pipeline** — delegate to `dynamic-tables` (normalize → block → match cascade) |
| 148 | +2. **Entity master table** — golden record view aggregating best values per `entity_group_id` |
| 149 | +3. **Source quality monitoring** — delegate to `data-quality` |
| 150 | + |
| 151 | +Extensions: `cortex-agent` for NL queries over match results; `machine-learning` for a custom classifier replacing or augmenting Tier 2/3. |
| 152 | + |
| 153 | +⚠️ STOPPING POINT: User approves pipeline design before creating dynamic tables. |
| 154 | + |
| 155 | +## Benchmark Mode |
| 156 | + |
| 157 | +When the user selects benchmark in Step 0, run each chosen approach on the same labeled sample (200–500 stratified pairs; if absent, deploy a lightweight HITL labeling app). Produce a comparison table of precision, recall, F1, cost ($), and latency (sec). Then ask which approach to use for the full run. |
| 158 | + |
| 159 | +## Common Mistakes |
| 160 | + |
| 161 | +- **Skipping profiling** — jumping straight to matching without identifying authoritative IDs over-relies on fuzzy tiers and inflates cost. |
| 162 | +- **Coarse blocking** — any block > 100K pairs explodes the comparison space; tighten keys. |
| 163 | +- **Skipping Tier 1** — deterministic ID matches are free and high-precision; always run them first. |
| 164 | +- **Sending all pairs to Tier 3** — `AI_CLASSIFY` should only see Tier 2 `probable_match` rows. `match` and `no_match` are already decided. |
| 165 | +- **Hardcoded thresholds** — 0.92 / 0.80 are starting points. Tune in 0.02–0.03 steps against a labeled sample. |
| 166 | +- **No transitive resolution** — A=B and B=C implies A=C; missing this fragments entity groups. |
| 167 | +- **Cosine without name check** — high cosine on short text can match unrelated entities. Supplement with `JAROWINKLER_SIMILARITY` on names and streets. |
| 168 | +- **Unbudgeted agents** — Cortex Agents can recurse expensively. Enforce tool-call, token, and wall-clock limits per entity. |
| 169 | +- **Contrastive without ground truth** — fewer than ~500 labeled entities across ~200 clusters yields unstable embeddings; fall back to `AI_EMBED`. |
| 170 | +- **Reusing thresholds across domains** — pharma name matching is not retail product matching; recalibrate per domain profile. |
| 171 | + |
| 172 | +## Stopping Points |
| 173 | + |
| 174 | +- Step 0 — confirm discovery summary |
| 175 | +- Step 1 — confirm profiling and match-readiness |
| 176 | +- Step 1b — accept cost estimate |
| 177 | +- Step 2 — validate sample normalization |
| 178 | +- Step 3 — confirm blocking statistics |
| 179 | +- Step 4 — review match results and thresholds (or benchmark comparison) |
| 180 | +- Step 4b — review per-tier agentic results |
| 181 | +- Step 4c — review contrastive threshold sweep |
| 182 | +- Step 5 — wait for HITL review completion |
| 183 | +- Step 6 — approve pipeline design before operationalizing |
| 184 | + |
| 185 | +## Output |
| 186 | + |
| 187 | +- Discovery summary, profiling report, normalized entities table |
| 188 | +- Candidate pairs with blocking diagnostics (Path A) or top-N candidates (Path B) |
| 189 | +- Match results with confidence, tier attribution, `entity_group_id` |
| 190 | +- Crosswalk and entity discoveries (Path B) |
| 191 | +- Contrastive embeddings table and threshold sweep (Step 4c) |
| 192 | +- Benchmark comparison report (benchmark mode) |
| 193 | +- Streamlit review app (if HITL selected) |
| 194 | +- Dynamic tables pipeline and entity master table (if ongoing) |
| 195 | + |
| 196 | +## References |
| 197 | + |
| 198 | +See `references/profiles/` for domain profiles and `references/templates/` for SQL patterns: `normalization.md`, `blocking.md`, `matching.md`, `agentic-matching.md`, `search-service.md`, `agent-definition.md`, `orchestration.md`, `contrastive-embeddings.md`, `cost-estimator.md`, `incremental.md`. App spec: `references/hitl-app.md`. Operationalization: `references/operationalize.md`. |
0 commit comments