You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+23Lines changed: 23 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -173,6 +173,29 @@ Merge output behavior with multiple datasets:
173
173
- Default (`run` with `execution_params.merge: true`, or `merge` without `--output-root`): each dataset is merged to its own `<dataset.output_dir>/merged`.
174
174
- Shared root (`merge --output-root ...`): one merged subdirectory is created per dataset under the root.
175
175
176
+
### Fuzzy deduplication (optional)
177
+
178
+
After merging, MMIRAGE can drop near-duplicate rows using character n-gram MinHash + LSH. This is CPU-only and uses the lightweight `datasketch` package.
179
+
180
+
Install the optional extra:
181
+
182
+
```bash
183
+
pip install -e '.[dedup]'
184
+
```
185
+
186
+
Enable in your YAML config:
187
+
188
+
```yaml
189
+
deduplication_params:
190
+
enabled: true
191
+
text_field: text
192
+
threshold: 0.85 # Jaccard similarity threshold
193
+
num_perm: 128 # MinHash signature size
194
+
shingle_size: 5 # character n-gram size
195
+
```
196
+
197
+
Dedup runs as part of `mmirage merge --config <cfg>` and as part of `mmirage run` when `execution_params.merge: true`. With `enabled: false` (default) the dedup module is not imported and there is no overhead.
198
+
176
199
### Multimodal: Processing images with VLMs
177
200
178
201
MMIRAGE supports multimodal processing with vision-language models:
0 commit comments