Skip to content

Commit 2a4a49b

Browse files
nikosbosseclaude
andauthored
Add strategy and strategy_prompt params to dedupe (#110)
* Add strategy and strategy_prompt params to dedupe SDK - Add `strategy` parameter to `dedupe()` / `dedupe_async()`: `"identify"`, `"select"` (default), or `"combine"` - Add `strategy_prompt` parameter for guiding LLM selection/combining - Update generated `DedupeOperation` model with new fields - Convert strategy string to `DedupeOperationStrategy` enum before passing to generated model (prevents AttributeError on serialization) - Update docs with strategy examples - Add integration tests for each strategy mode Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Expand strategy and strategy_prompt docstrings for CC readability Address PR review feedback: make parameter descriptions more verbose so that Claude Code and similar tools can understand the full behavior of each strategy mode and how strategy_prompt interacts with them. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
1 parent 71ddee9 commit 2a4a49b

File tree

6 files changed

+246
-6
lines changed

6 files changed

+246
-6
lines changed

docs/reference/DEDUPE.md

Lines changed: 54 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -35,9 +35,60 @@ result = await dedupe(
3535
print(result.data.head())
3636
```
3737

38+
## Strategies
39+
40+
Control what happens after clusters are identified using the `strategy` parameter:
41+
42+
### `select` (default)
43+
44+
Picks the best representative from each cluster. Three columns are added:
45+
46+
- `equivalence_class_id` — rows with the same ID are duplicates of each other
47+
- `equivalence_class_name` — human-readable label for the cluster
48+
- `selected` — True for the canonical record in each cluster
49+
50+
```python
51+
result = await dedupe(
52+
input=crm_data,
53+
equivalence_relation="Same legal entity",
54+
strategy="select",
55+
strategy_prompt="Prefer the record with the most complete contact information",
56+
)
57+
deduped = result.data[result.data["selected"] == True]
58+
```
59+
60+
### `identify`
61+
62+
Cluster only — no selection or combining. Useful when you want to review clusters before deciding what to do.
63+
64+
- `equivalence_class_id` — rows with the same ID are duplicates of each other
65+
- `equivalence_class_name` — human-readable label for the cluster
66+
67+
```python
68+
result = await dedupe(
69+
input=crm_data,
70+
equivalence_relation="Same legal entity",
71+
strategy="identify",
72+
)
73+
```
74+
75+
### `combine`
76+
77+
Synthesizes a single combined row per cluster, merging the best information from all duplicates. Original rows are marked `selected=False`, and new combined rows are added with `selected=True`.
78+
79+
```python
80+
result = await dedupe(
81+
input=crm_data,
82+
equivalence_relation="Same legal entity",
83+
strategy="combine",
84+
strategy_prompt="For each field, keep the most recent and complete value",
85+
)
86+
combined = result.data[result.data["selected"] == True]
87+
```
88+
3889
## What you get back
3990

40-
Three columns added to your data:
91+
Three columns added to your data (when using `select` or `combine` strategy):
4192

4293
- `equivalence_class_id` — rows with the same ID are duplicates of each other
4394
- `equivalence_class_name` — human-readable label for the cluster ("Alexandra Butoi", "Naomi Saphra", etc.)
@@ -73,6 +124,8 @@ Output (selected rows only):
73124
|------|------|-------------|
74125
| `input` | DataFrame | Data with potential duplicates |
75126
| `equivalence_relation` | str | What makes two rows duplicates |
127+
| `strategy` | str | `"identify"`, `"select"` (default), or `"combine"` |
128+
| `strategy_prompt` | str | Optional instructions for selection or combining |
76129
| `session` | Session | Optional, auto-created if omitted |
77130

78131
## Performance

src/everyrow/generated/models/__init__.py

Lines changed: 8 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,9 @@
33
from .agent_map_operation import AgentMapOperation
44
from .agent_map_operation_input_type_1_item import AgentMapOperationInputType1Item
55
from .agent_map_operation_input_type_2 import AgentMapOperationInputType2
6-
from .agent_map_operation_response_schema_type_0 import AgentMapOperationResponseSchemaType0
6+
from .agent_map_operation_response_schema_type_0 import (
7+
AgentMapOperationResponseSchemaType0,
8+
)
79
from .billing_response import BillingResponse
810
from .create_artifact_request import CreateArtifactRequest
911
from .create_artifact_request_data_type_0_item import CreateArtifactRequestDataType0Item
@@ -13,6 +15,7 @@
1315
from .dedupe_operation import DedupeOperation
1416
from .dedupe_operation_input_type_1_item import DedupeOperationInputType1Item
1517
from .dedupe_operation_input_type_2 import DedupeOperationInputType2
18+
from .dedupe_operation_strategy import DedupeOperationStrategy
1619
from .error_response import ErrorResponse
1720
from .error_response_details_type_0 import ErrorResponseDetailsType0
1821
from .http_validation_error import HTTPValidationError
@@ -39,7 +42,9 @@
3942
from .single_agent_operation import SingleAgentOperation
4043
from .single_agent_operation_input_type_1_item import SingleAgentOperationInputType1Item
4144
from .single_agent_operation_input_type_2 import SingleAgentOperationInputType2
42-
from .single_agent_operation_response_schema_type_0 import SingleAgentOperationResponseSchemaType0
45+
from .single_agent_operation_response_schema_type_0 import (
46+
SingleAgentOperationResponseSchemaType0,
47+
)
4348
from .task_result_response import TaskResultResponse
4449
from .task_result_response_data_type_0_item import TaskResultResponseDataType0Item
4550
from .task_result_response_data_type_1 import TaskResultResponseDataType1
@@ -61,6 +66,7 @@
6166
"DedupeOperation",
6267
"DedupeOperationInputType1Item",
6368
"DedupeOperationInputType2",
69+
"DedupeOperationStrategy",
6470
"ErrorResponse",
6571
"ErrorResponseDetailsType0",
6672
"HTTPValidationError",

src/everyrow/generated/models/dedupe_operation.py

Lines changed: 40 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@
77
from attrs import define as _attrs_define
88
from attrs import field as _attrs_field
99

10+
from ..models.dedupe_operation_strategy import DedupeOperationStrategy
1011
from ..types import UNSET, Unset
1112

1213
if TYPE_CHECKING:
@@ -26,11 +27,18 @@ class DedupeOperation:
2627
list of JSON objects
2728
equivalence_relation (str): Description of what makes two rows equivalent/duplicates
2829
session_id (None | Unset | UUID): Session ID. If not provided, a new session is auto-created for this task.
30+
strategy (DedupeOperationStrategy | Unset): Controls what happens after duplicate clusters are identified.
31+
IDENTIFY = cluster only (no selection/combining), SELECT = pick best representative per cluster (default),
32+
COMBINE = synthesize a merged row per cluster from all duplicates. Default: DedupeOperationStrategy.SELECT.
33+
strategy_prompt (None | str | Unset): Optional natural-language instructions guiding how the LLM selects or
34+
combines rows (only used with SELECT and COMBINE strategies). Example: "Prefer the most complete record".
2935
"""
3036

3137
input_: DedupeOperationInputType2 | list[DedupeOperationInputType1Item] | UUID
3238
equivalence_relation: str
3339
session_id: None | Unset | UUID = UNSET
40+
strategy: DedupeOperationStrategy | Unset = DedupeOperationStrategy.SELECT
41+
strategy_prompt: None | str | Unset = UNSET
3442
additional_properties: dict[str, Any] = _attrs_field(init=False, factory=dict)
3543

3644
def to_dict(self) -> dict[str, Any]:
@@ -56,6 +64,16 @@ def to_dict(self) -> dict[str, Any]:
5664
else:
5765
session_id = self.session_id
5866

67+
strategy: str | Unset = UNSET
68+
if not isinstance(self.strategy, Unset):
69+
strategy = self.strategy.value
70+
71+
strategy_prompt: None | str | Unset
72+
if isinstance(self.strategy_prompt, Unset):
73+
strategy_prompt = UNSET
74+
else:
75+
strategy_prompt = self.strategy_prompt
76+
5977
field_dict: dict[str, Any] = {}
6078
field_dict.update(self.additional_properties)
6179
field_dict.update(
@@ -66,6 +84,10 @@ def to_dict(self) -> dict[str, Any]:
6684
)
6785
if session_id is not UNSET:
6886
field_dict["session_id"] = session_id
87+
if strategy is not UNSET:
88+
field_dict["strategy"] = strategy
89+
if strategy_prompt is not UNSET:
90+
field_dict["strategy_prompt"] = strategy_prompt
6991

7092
return field_dict
7193

@@ -125,10 +147,28 @@ def _parse_session_id(data: object) -> None | Unset | UUID:
125147

126148
session_id = _parse_session_id(d.pop("session_id", UNSET))
127149

150+
_strategy = d.pop("strategy", UNSET)
151+
strategy: DedupeOperationStrategy | Unset
152+
if isinstance(_strategy, Unset):
153+
strategy = UNSET
154+
else:
155+
strategy = DedupeOperationStrategy(_strategy)
156+
157+
def _parse_strategy_prompt(data: object) -> None | str | Unset:
158+
if data is None:
159+
return data
160+
if isinstance(data, Unset):
161+
return data
162+
return cast(None | str | Unset, data)
163+
164+
strategy_prompt = _parse_strategy_prompt(d.pop("strategy_prompt", UNSET))
165+
128166
dedupe_operation = cls(
129167
input_=input_,
130168
equivalence_relation=equivalence_relation,
131169
session_id=session_id,
170+
strategy=strategy,
171+
strategy_prompt=strategy_prompt,
132172
)
133173

134174
dedupe_operation.additional_properties = d
Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
from enum import Enum
2+
3+
4+
class DedupeOperationStrategy(str, Enum):
5+
COMBINE = "combine"
6+
IDENTIFY = "identify"
7+
SELECT = "select"
8+
9+
def __str__(self) -> str:
10+
return str(self.value)

src/everyrow/ops.py

Lines changed: 35 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,7 @@
2525
CreateArtifactRequestDataType1,
2626
DedupeOperation,
2727
DedupeOperationInputType1Item,
28+
DedupeOperationStrategy,
2829
LLMEnumPublic,
2930
MergeOperation,
3031
MergeOperationLeftInputType1Item,
@@ -633,16 +634,39 @@ async def dedupe(
633634
equivalence_relation: str,
634635
session: Session | None = None,
635636
input: DataFrame | UUID | TableResult | None = None,
637+
strategy: Literal["identify", "select", "combine"] | None = None,
638+
strategy_prompt: str | None = None,
636639
) -> TableResult:
637640
"""Dedupe a table by removing duplicates using AI.
638641
639642
Args:
640-
equivalence_relation: Description of what makes items equivalent
643+
equivalence_relation: Natural-language description of what makes two rows
644+
equivalent/duplicates. Be as specific as needed — the LLM uses this to
645+
reason about equivalence, handling abbreviations, typos, name variations,
646+
and entity relationships that string matching cannot capture.
641647
session: Optional session. If not provided, one will be created automatically.
642-
input: The input table (DataFrame, UUID, or TableResult)
648+
input: The input table (DataFrame, UUID, or TableResult).
649+
strategy: Controls what happens after duplicate clusters are identified.
650+
- "identify": Cluster only. Adds `equivalence_class_id` and
651+
`equivalence_class_name` columns but does NOT select or remove any rows.
652+
Use this when you want to review clusters before deciding what to do.
653+
- "select" (default): Picks the best representative row from each cluster.
654+
Adds `equivalence_class_id`, `equivalence_class_name`, and `selected`
655+
columns. Rows with `selected=True` are the canonical records. To get the
656+
deduplicated table: `result.data[result.data["selected"] == True]`.
657+
- "combine": Synthesizes a single combined row per cluster by merging the
658+
best information from all duplicates. Original rows are kept with
659+
`selected=False`, and new combined rows are appended with `selected=True`.
660+
Useful when no single row has all the information (e.g., one row has the
661+
email, another has the phone number).
662+
strategy_prompt: Optional natural-language instructions that guide how the LLM
663+
selects or combines rows. Only used with "select" and "combine" strategies.
664+
Examples: "Prefer the record with the most complete contact information",
665+
"For each field, keep the most recent and complete value",
666+
"Prefer records from the CRM system over spreadsheet imports".
643667
644668
Returns:
645-
TableResult containing the deduped table
669+
TableResult containing the deduped table with cluster metadata columns.
646670
"""
647671
if input is None:
648672
raise EveryrowError("input is required for dedupe")
@@ -652,6 +676,8 @@ async def dedupe(
652676
session=internal_session,
653677
input=input,
654678
equivalence_relation=equivalence_relation,
679+
strategy=strategy,
680+
strategy_prompt=strategy_prompt,
655681
)
656682
result = await cohort_task.await_result()
657683
if isinstance(result, TableResult):
@@ -661,6 +687,8 @@ async def dedupe(
661687
session=session,
662688
input=input,
663689
equivalence_relation=equivalence_relation,
690+
strategy=strategy,
691+
strategy_prompt=strategy_prompt,
664692
)
665693
result = await cohort_task.await_result()
666694
if isinstance(result, TableResult):
@@ -672,6 +700,8 @@ async def dedupe_async(
672700
session: Session,
673701
input: DataFrame | UUID | TableResult,
674702
equivalence_relation: str,
703+
strategy: Literal["identify", "select", "combine"] | None = None,
704+
strategy_prompt: str | None = None,
675705
) -> EveryrowTask[BaseModel]:
676706
"""Submit a dedupe task asynchronously."""
677707
input_data = _prepare_table_input(input, DedupeOperationInputType1Item)
@@ -680,6 +710,8 @@ async def dedupe_async(
680710
input_=input_data, # type: ignore
681711
equivalence_relation=equivalence_relation,
682712
session_id=session.session_id,
713+
strategy=DedupeOperationStrategy(strategy) if strategy is not None else UNSET,
714+
strategy_prompt=strategy_prompt if strategy_prompt is not None else UNSET,
683715
)
684716

685717
response = await dedupe_operations_dedupe_post.asyncio(client=session.client, body=body)

tests/integration/test_dedupe.py

Lines changed: 99 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -102,3 +102,102 @@ async def test_dedupe_unique_items_all_selected():
102102
assert result.data["equivalence_class_id"].nunique() == 3
103103
# All should be selected since they're unique
104104
assert result.data["selected"].all() # pyright: ignore[reportGeneralTypeIssues]
105+
106+
107+
async def test_dedupe_identify_strategy_no_selection():
108+
"""Test that identify strategy clusters but does not add a 'selected' column."""
109+
input_df = pd.DataFrame(
110+
[
111+
{"title": "Paper A - Preprint", "version": "v1"},
112+
{"title": "Paper A - Published", "version": "v2"},
113+
{"title": "Paper B", "version": "v1"},
114+
]
115+
)
116+
117+
result = await dedupe(
118+
equivalence_relation="""
119+
Two entries are duplicates if they are versions of the same paper.
120+
"Paper A - Preprint" and "Paper A - Published" are the same paper.
121+
""",
122+
input=input_df,
123+
strategy="identify",
124+
)
125+
126+
assert isinstance(result, TableResult)
127+
assert "equivalence_class_id" in result.data.columns
128+
# identify strategy should NOT add a selected column
129+
assert "selected" not in result.data.columns
130+
131+
132+
async def test_dedupe_combine_strategy_creates_combined_rows():
133+
"""Test that combine strategy produces combined rows marked as selected."""
134+
input_df = pd.DataFrame(
135+
[
136+
{"title": "Paper A - Preprint", "year": "2023"},
137+
{"title": "Paper A - Published", "year": "2024"},
138+
{"title": "Paper B", "year": "2023"},
139+
]
140+
)
141+
142+
result = await dedupe(
143+
equivalence_relation="""
144+
Two entries are duplicates if they are versions of the same paper.
145+
"Paper A - Preprint" and "Paper A - Published" are the same paper.
146+
""",
147+
input=input_df,
148+
strategy="combine",
149+
)
150+
151+
assert isinstance(result, TableResult)
152+
assert "equivalence_class_id" in result.data.columns
153+
assert "selected" in result.data.columns
154+
155+
# There should be at least one selected row
156+
selected_rows = result.data[result.data["selected"] == True] # noqa: E712
157+
assert len(selected_rows) >= 1
158+
159+
160+
async def test_dedupe_select_strategy_explicit():
161+
"""Test that explicitly passing strategy='select' works the same as the default."""
162+
input_df = pd.DataFrame(
163+
[
164+
{"item": "Apple"},
165+
{"item": "Banana"},
166+
]
167+
)
168+
169+
result = await dedupe(
170+
equivalence_relation="Items are duplicates only if they are the exact same fruit name.",
171+
input=input_df,
172+
strategy="select",
173+
)
174+
175+
assert isinstance(result, TableResult)
176+
assert "equivalence_class_id" in result.data.columns
177+
assert "selected" in result.data.columns
178+
# All unique items should be selected
179+
assert result.data["selected"].all() # pyright: ignore[reportGeneralTypeIssues]
180+
181+
182+
async def test_dedupe_with_strategy_prompt():
183+
"""Test that strategy_prompt parameter is accepted."""
184+
input_df = pd.DataFrame(
185+
[
186+
{"title": "Paper A - Preprint", "version": "v1"},
187+
{"title": "Paper A - Published", "version": "v2"},
188+
{"title": "Paper B", "version": "v1"},
189+
]
190+
)
191+
192+
result = await dedupe(
193+
equivalence_relation="""
194+
Two entries are duplicates if they are versions of the same paper.
195+
"Paper A - Preprint" and "Paper A - Published" are the same paper.
196+
""",
197+
input=input_df,
198+
strategy="select",
199+
strategy_prompt="Always prefer the published version over the preprint.",
200+
)
201+
202+
assert isinstance(result, TableResult)
203+
assert "selected" in result.data.columns

0 commit comments

Comments
 (0)