Skip to content

Commit 0e2273a

Browse files
committed
Promote entity-resolution with v1.1.1 audit pass
Re-staged with v1.1.1 holistic prompt that adds stopping-point markers, correct INSTRUCTIONS.md sub-flow cross-refs, and drops invalid tool snowflake_object_search.
1 parent 49dd772 commit 0e2273a

46 files changed

Lines changed: 8816 additions & 0 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

skills/entity-resolution/INTEGRATION-TESTING.md

Lines changed: 838 additions & 0 deletions
Large diffs are not rendered by default.

skills/entity-resolution/LICENSE

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
Snowflake Skills License
2+
3+
© 2026 Snowflake Inc. All rights reserved.
4+
5+
LICENSE: Use of these materials (including all code, prompts, assets, files, and other components of these skills (collectively, “Skills”)) is governed by your agreement with Snowflake for the Service. If no separate agreement exists, use is governed by Snowflake’s Terms of Service (available at: https://www.snowflake.com/en/legal/terms-of-service/).
6+
7+
Your applicable agreement is referred to as the "Agreement." "Service" is as defined in the Agreement.
8+
9+
ADDITIONAL RESTRICTIONS: Notwithstanding anything in the Agreement to the contrary, you may not:
10+
11+
* Extract from the Service or retain copies of the Skills outside use with the Service;
12+
* Reproduce or copy the Skills , except for temporary copies created automatically during authorized use of the Service;
13+
* Create derivative works based on the Skills;
14+
* Distribute, sublicense, or transfer the Skills to any third party;
15+
* Make, offer to sell, sell, or import any inventions embodied in the Skills; nor,
16+
* Reverse engineer, decompile, or disassemble the Skills.
17+
18+
The receipt, viewing, or possession of the Skills does not convey or imply any license or right beyond those expressly granted above.
19+
20+
Snowflake retains all rights, title, and interest in the Skills, including all copyrights, trademarks, patents, and all other applicable intellectual property rights.
21+
22+
THE SKILLS ARE PROVIDED “AS IS,” WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SKILLS OR THE USE OR OTHER DEALINGS IN THE SKILLS.

skills/entity-resolution/SKILL.md

Lines changed: 198 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,198 @@
1+
---
2+
name: entity-resolution
3+
title: Resolve Entities
4+
summary: End-to-end entity resolution pipeline using Snowflake Cortex AI Functions to match, link, and dedupe records.
5+
description: Use when you need to match records across datasets, deduplicate within a dataset, build a golden record, or link source records to a reference corpus. Orchestrates profiling, normalization, blocking, multi-tier matching (deterministic, fuzzy, AI-judged, agentic, contrastive), human review, and operationalization via dynamic tables. Industry-agnostic with optional domain profiles for pharma, financial services, retail/CPG, healthcare, and insurance. Triggers: entity resolution, record matching, deduplication, record linkage, fuzzy matching, golden record, master data, MDM, merge records, match entities, link records, dedupe, duplicate detection, identity resolution.
6+
tools:
7+
- snowflake_sql_execute
8+
- Bash
9+
- Read
10+
- Write
11+
- Edit
12+
- Glob
13+
- Grep
14+
prompt: Help me resolve entities across my customer and prospect tables.
15+
language: en
16+
status: Published
17+
author: Snowflake Solutions Team
18+
type: snowflake
19+
---
20+
21+
# Resolve Entities
22+
23+
## Overview
24+
25+
Determine whether records refer to the same real-world entity. This skill orchestrates Cortex AI Functions, dynamic tables, Streamlit, and optionally Cortex Agents and Snowflake ML through a structured pipeline: profile → normalize → block → match → score → review → operationalize.
26+
27+
Two workflow paths:
28+
29+
- **Path A — Pair-Based Matching (Deduplication / Cross-Match):** generate candidate pairs via blocking, score through 3 tiers.
30+
- **Path B — Agentic Matching (Entity Linking):** resolve source records against a reference corpus via tiered escalation.
31+
32+
Optional add-on:
33+
34+
- **Contrastive Embeddings:** train a domain-adapted encoder via SupConLoss when labeled data and GPU compute are available.
35+
36+
## When to Use
37+
38+
- Match or link records across datasets
39+
- Deduplicate records within a single dataset
40+
- Build a golden record from multiple sources
41+
- Assess match-readiness of source data
42+
- Operationalize an ongoing matching pipeline
43+
44+
For unstructured inputs (PDFs, scans), call `cortex-ai-functions` (`AI_EXTRACT`, `AI_PARSE_DOCUMENT`) first, then feed structured output here.
45+
46+
## Domain Profiles
47+
48+
Load the matching profile from `references/profiles/`:
49+
50+
| Keywords | Profile | Tier 1 IDs |
51+
|---|---|---|
52+
| pharma, NPI, DEA, NCPDP | `pharma.md` | NPI, DEA, NCPDP |
53+
| bank, KYC, AML, LEI | `financial-services.md` | LEI, DUNS, Tax ID, CRD |
54+
| retail, CPG, GTIN, UPC | `retail-cpg.md` | GTIN/UPC, GLN, Supplier ID |
55+
| provider, hospital, taxonomy | `healthcare-provider.md` | NPI, Taxonomy Code |
56+
| insurance, payer, NAIC | `insurance-payer.md` | NAIC, CMS Payer ID, Plan ID |
57+
| *(none)* | `generic.md` | Name + address only |
58+
59+
## Workflow
60+
61+
### Step 0: Discovery
62+
63+
Use `ask_user_question` to collect: goal (dedupe / cross-match / link to reference / profile-only), domain, source and reference tables, volume, output schema, HITL preference, pipeline cadence (one-time vs ongoing), data language, and approach (recommended / specific / benchmark multiple). Present a discovery summary.
64+
65+
⚠️ STOPPING POINT: User confirms the summary before any work begins.
66+
67+
### Step 1: Profiling
68+
69+
Delegate to `profiling-tables`. ER-specific checks:
70+
71+
1. Identifier detection from the loaded domain profile (Tier 1 candidates)
72+
2. Name/address column detection via `AI_CLASSIFY`
73+
3. Completeness, format consistency, duplicate density, volume
74+
75+
⚠️ STOPPING POINT: Present profiling report and match-readiness checklist.
76+
77+
### Step 1b: Cost Estimation
78+
79+
Load `references/templates/cost-estimator.md`. Estimate based on volume, path, expected tier distribution, and warehouse sizing.
80+
81+
⚠️ STOPPING POINT: User acknowledges and accepts the estimate.
82+
83+
### Step 2: Normalization
84+
85+
Delegate to `cortex-ai-functions` for `AI_EXTRACT` (address parsing) and `AI_COMPLETE` (edge-case names). Load `references/templates/normalization.md`. Materialize `normalized_entities` with `source_id`, normalized fields, raw originals, and `blocking_key`.
86+
87+
⚠️ STOPPING POINT: Validate 10–20 sample rows, NULL counts per normalized field, and identifier format compliance.
88+
89+
### Step 3: Blocking
90+
91+
Load `references/templates/blocking.md`. Reduce O(n²) pair space via blocking keys (geographic, category, phonetic, or ID prefix). Self-join within blocks: `a.source_id < b.source_id`.
92+
93+
⚠️ STOPPING POINT: Review blocking statistics. Red flags: any block > 100K pairs, reduction ratio > 10%, or very few pairs.
94+
95+
### Step 4: Multi-Tier Matching (Path A)
96+
97+
Load `references/templates/matching.md`.
98+
99+
- **Tier 1 — Deterministic:** exact match on authoritative IDs. Pure SQL, no AI cost.
100+
- **Tier 2 — Fuzzy:** `AI_EMBED` (`snowflake-arctic-embed-l-v2.0`) + `VECTOR_COSINE_SIMILARITY`, supplemented by `JAROWINKLER_SIMILARITY` on name/street. Starting thresholds: ≥ 0.92 match, ≥ 0.80 probable_match, < 0.80 no_match.
101+
- **Tier 3 — AI-Judged:** `AI_CLASSIFY` on Tier 2 `probable_match` rows only (cost control).
102+
103+
Resolve transitive matches and assign `entity_group_id` to connected components.
104+
105+
⚠️ STOPPING POINT: Review counts by tier and decision, sample rows, and threshold tuning (adjust in 0.02–0.03 increments against a labeled sample).
106+
107+
### Step 4b: Agentic Matching (Path B)
108+
109+
Load `references/templates/agentic-matching.md`. Prerequisites: normalized entities, embeddings on source and reference, top-N candidates, Cortex Search Service over the reference corpus (`references/templates/search-service.md`), and a semantic model YAML.
110+
111+
- **Tier 1 — High-Confidence Triage:** cosine + Jaro-Winkler with domain guards (chain stores, multi-tenant buildings, address floor).
112+
- **Tier 1.5 — Search + Classify:** Cortex Search top-N then `AI_COMPLETE` (cost-effective model). Confidence ≥ 0.80 + name/address alignment.
113+
- **Tier 2 — Cortex Agent:** 3 tools — `cortex_search` (primary), `cortex_analyst_text_to_sql` (fallback), `web_search` (last resort). Budget: 6 tool calls, 90s, 16K tokens per entity. See `references/templates/agent-definition.md` and `references/templates/orchestration.md`. Delegate to `cortex-agent`.
114+
115+
UNION ALL all tiers into a crosswalk with tier attribution. Records entities confirmed active but missing from the reference corpus as discoveries.
116+
117+
⚠️ STOPPING POINT: Review per-tier match/closure/discovery distributions and web search usage.
118+
119+
### Step 4c: Contrastive Embeddings (Standalone or Tier 2 Replacement)
120+
121+
Load `references/templates/contrastive-embeddings.md`. Prerequisites: ≥ 500 labeled entities across ≥ 200 clusters, GPU pool (`GPU_NV_S`), `PYPI_EAI` and `HF_EAI` external access integrations.
122+
123+
1. **Model selection:** English-only → `roberta-base` (NER off); multilingual → `xlm-roberta-base` (NER on); resource-constrained → `all-MiniLM-L6-v2`.
124+
2. **Serialize** entities with `[COL]/[VAL]` tokens; derive cluster IDs via Union-Find.
125+
3. **Train** via stored procedure on the GPU pool — delegate to `machine-learning` for setup and monitoring.
126+
4. **Block** by cosine ≥ 0.50, run threshold sweep against ground truth, materialize matches at optimal F1.
127+
5. **Add-on mode:** replace `AI_EMBED` in Tier 2 of Path A; use match/no-match thresholds with escalation to Tier 3 in between.
128+
129+
⚠️ STOPPING POINT: Review threshold sweep, optimal F1/precision/recall, and (if add-on) comparison with `AI_EMBED` Tier 2.
130+
131+
### Step 5: Human-in-the-Loop Review
132+
133+
Delegate to `developing-with-streamlit`. Load `references/hitl-app.md`. App requirements:
134+
135+
- Side-by-side source vs. resolved entity with field-level match indicators
136+
- Accept / reject / flag with optional comment
137+
- Sequential nav, progress tracking
138+
- Decisions persisted to `REVIEW_DECISIONS`
139+
- Material Design CSS, no emojis, no third-party MUI
140+
141+
⚠️ STOPPING POINT: Deploy app and wait for the review cycle to complete.
142+
143+
### Step 6: Operationalize
144+
145+
Load `references/operationalize.md`.
146+
147+
1. **Dynamic tables pipeline** — delegate to `dynamic-tables` (normalize → block → match cascade)
148+
2. **Entity master table** — golden record view aggregating best values per `entity_group_id`
149+
3. **Source quality monitoring** — delegate to `data-quality`
150+
151+
Extensions: `cortex-agent` for NL queries over match results; `machine-learning` for a custom classifier replacing or augmenting Tier 2/3.
152+
153+
⚠️ STOPPING POINT: User approves pipeline design before creating dynamic tables.
154+
155+
## Benchmark Mode
156+
157+
When the user selects benchmark in Step 0, run each chosen approach on the same labeled sample (200–500 stratified pairs; if absent, deploy a lightweight HITL labeling app). Produce a comparison table of precision, recall, F1, cost ($), and latency (sec). Then ask which approach to use for the full run.
158+
159+
## Common Mistakes
160+
161+
- **Skipping profiling** — jumping straight to matching without identifying authoritative IDs over-relies on fuzzy tiers and inflates cost.
162+
- **Coarse blocking** — any block > 100K pairs explodes the comparison space; tighten keys.
163+
- **Skipping Tier 1** — deterministic ID matches are free and high-precision; always run them first.
164+
- **Sending all pairs to Tier 3**`AI_CLASSIFY` should only see Tier 2 `probable_match` rows. `match` and `no_match` are already decided.
165+
- **Hardcoded thresholds** — 0.92 / 0.80 are starting points. Tune in 0.02–0.03 steps against a labeled sample.
166+
- **No transitive resolution** — A=B and B=C implies A=C; missing this fragments entity groups.
167+
- **Cosine without name check** — high cosine on short text can match unrelated entities. Supplement with `JAROWINKLER_SIMILARITY` on names and streets.
168+
- **Unbudgeted agents** — Cortex Agents can recurse expensively. Enforce tool-call, token, and wall-clock limits per entity.
169+
- **Contrastive without ground truth** — fewer than ~500 labeled entities across ~200 clusters yields unstable embeddings; fall back to `AI_EMBED`.
170+
- **Reusing thresholds across domains** — pharma name matching is not retail product matching; recalibrate per domain profile.
171+
172+
## Stopping Points
173+
174+
- Step 0 — confirm discovery summary
175+
- Step 1 — confirm profiling and match-readiness
176+
- Step 1b — accept cost estimate
177+
- Step 2 — validate sample normalization
178+
- Step 3 — confirm blocking statistics
179+
- Step 4 — review match results and thresholds (or benchmark comparison)
180+
- Step 4b — review per-tier agentic results
181+
- Step 4c — review contrastive threshold sweep
182+
- Step 5 — wait for HITL review completion
183+
- Step 6 — approve pipeline design before operationalizing
184+
185+
## Output
186+
187+
- Discovery summary, profiling report, normalized entities table
188+
- Candidate pairs with blocking diagnostics (Path A) or top-N candidates (Path B)
189+
- Match results with confidence, tier attribution, `entity_group_id`
190+
- Crosswalk and entity discoveries (Path B)
191+
- Contrastive embeddings table and threshold sweep (Step 4c)
192+
- Benchmark comparison report (benchmark mode)
193+
- Streamlit review app (if HITL selected)
194+
- Dynamic tables pipeline and entity master table (if ongoing)
195+
196+
## References
197+
198+
See `references/profiles/` for domain profiles and `references/templates/` for SQL patterns: `normalization.md`, `blocking.md`, `matching.md`, `agentic-matching.md`, `search-service.md`, `agent-definition.md`, `orchestration.md`, `contrastive-embeddings.md`, `cost-estimator.md`, `incremental.md`. App spec: `references/hitl-app.md`. Operationalization: `references/operationalize.md`.

0 commit comments

Comments
 (0)