feat: add fuzzy deduplication post-processing#37
Conversation
|
Hi @HarshaSatyavardhan it would be great for the future to wait for approval of the issues you submit before spending time on implementing them. You can ping again if several days have passed with no life sign from me, I'm busy with several different projects and can't monitor MMIRAGE every day but I should check at least once a week. |
|
After some thought, I think the embedding-based similarity is pretty overkill. Character-based similarity should be more accurate, faster and less computationally intensive |
84201b8 to
9f4cb0b
Compare
|
@fabnemEPFL switched to character-based fuzzy dedup as you suggested. It's now char n-gram MinHash + LSH via I also rebased onto current One thing worth flagging: this branch is rebased on |
|
Hi Harsha, thanks for the work on this. After trying this on the test dataset you provided and realizing it was still quite slow, I think I found an idea to make the algorithm faster: |
|
Addressed the slowness as you suggested
|
Closes #33
Summary
datasketch)DeduplicationParamsin YAML (deduplication_params.enabled: true)threshold,num_perm,shingle_size,text_fieldpip install -e '.[dedup]'Approach
For each row in the merged dataset I build the set of character n-grams, compute a MinHash signature, and query an LSH index built so far. If a near-duplicate is already indexed (Jaccard above
threshold), the row is dropped; otherwise it's inserted and kept.Files changed
pyproject.toml—[dedup]extras =["datasketch>=1.6.0"]src/mmirage/config/config.py—DeduplicationParams(enabled,text_field,threshold,num_perm,shingle_size)src/mmirage/core/postprocess/fuzzy_dedup.py— MinHash-LSH dedup logicsrc/mmirage/merge_shards.py— dedup hook inmerge_dataset_dir/merge_input_dir/merge_from_config, plus a--configflag on the CLItests/test_dedup.py— TinyStories smoke test with--limit/--threshold/--num-perm/--shingle-sizeconfigs/config_comprehensive.yaml—deduplication_paramsexample blockREADME.md— "Fuzzy deduplication (optional)" sectionYAML config
Tested
datasketch: cleanImportErrorwith install hint.enabled: false(default):datasketchis not imported, zero overhead.main; integrates with the merge API from Feature/merge #35.