feat: add fuzzy deduplication post-processing by HarshaSatyavardhan · Pull Request #37 · EPFLiGHT/MMIRAGE

HarshaSatyavardhan · 2026-04-13T19:37:13Z

Closes #33

Summary

Optional fuzzy deduplication as a post-merge step, using character n-gram MinHash + LSH (datasketch)
CPU-only, no GPU/torch dependency
Configurable via DeduplicationParams in YAML (deduplication_params.enabled: true)
Tunable knobs: threshold, num_perm, shingle_size, text_field
Optional dependency group: pip install -e '.[dedup]'
Streaming "first-seen wins" pass — linear time, lazy-imported (no overhead when disabled)

Approach

For each row in the merged dataset I build the set of character n-grams, compute a MinHash signature, and query an LSH index built so far. If a near-duplicate is already indexed (Jaccard above threshold), the row is dropped; otherwise it's inserted and kept.

Files changed

pyproject.toml — [dedup] extras = ["datasketch>=1.6.0"]
src/mmirage/config/config.py — DeduplicationParams (enabled, text_field, threshold, num_perm, shingle_size)
src/mmirage/core/postprocess/fuzzy_dedup.py — MinHash-LSH dedup logic
src/mmirage/merge_shards.py — dedup hook in merge_dataset_dir / merge_input_dir / merge_from_config, plus a --config flag on the CLI
tests/test_dedup.py — TinyStories smoke test with --limit / --threshold / --num-perm / --shingle-size
configs/config_comprehensive.yaml — deduplication_params example block
README.md — "Fuzzy deduplication (optional)" section

YAML config

deduplication_params:
  enabled: true
  text_field: text
  threshold: 0.85       # Jaccard similarity threshold
  num_perm: 128         # MinHash signature size
  shingle_size: 5       # character n-gram size

Tested

Synthetic near-duplicate dataset: detects and drops the near-duplicate, keeps distinct rows.
Missing datasketch: clean ImportError with install hint.
enabled: false (default): datasketch is not imported, zero overhead.
Rebased on current main; integrates with the merge API from Feature/merge #35.

fabnemEPFL · 2026-04-13T20:34:26Z

Hi @HarshaSatyavardhan it would be great for the future to wait for approval of the issues you submit before spending time on implementing them. You can ping again if several days have passed with no life sign from me, I'm busy with several different projects and can't monitor MMIRAGE every day but I should check at least once a week.
But this feature sounds pretty promising, I'm reviewing now

fabnemEPFL · 2026-04-13T20:39:01Z

After some thought, I think the embedding-based similarity is pretty overkill. Character-based similarity should be more accurate, faster and less computationally intensive

HarshaSatyavardhan · 2026-04-26T02:29:48Z

@fabnemEPFL switched to character-based fuzzy dedup as you suggested. It's now char n-gram MinHash + LSH via datasketch — single small dep, CPU-only, lazy-imported when disabled.

I also rebased onto current main so the dedup hook lives in the new merge_dataset_dir / merge_from_config paths from #35, and updated the PR title and body to match.

One thing worth flagging: this branch is rebased on main, so it temporarily reincludes the typing.override issue in jsonl.py on Python 3.10/3.11. The fix is in #30; once that lands this branch picks it up automatically.

fabnemEPFL · 2026-04-29T13:26:13Z

Hi Harsha, thanks for the work on this. After trying this on the test dataset you provided and realizing it was still quite slow, I think I found an idea to make the algorithm faster:
assuming duplicates can also be exact, a way could be to compute first the hash of each sample and then find easily which are identical. Then once this is done, let the script perform the character-based fuzzy dedup (could be enabled/disabled depending on whether the user cares only about exact match).

HarshaSatyavardhan · 2026-05-05T06:18:25Z

Addressed the slowness as you suggested

added an exact-hash pre-pass before fuzzy (hashlib.blake2b over normalized text), and switched the MinHash loop to update_batch for an additional speedup. Both stages are independently toggleable via exact / fuzzy flags; with enabled: true and no sub-flags set
Added an orchestrator-level test in tests/test_dedup.py covering all four (exact, fuzzy) combos plus the no-op guard
Tests pass on Python 3.12+; on 3.10/3.11 the test file currently can't import the package due to the unrelated typing.override issue tracked in #30

feat: add fuzzy deduplication post-processing

9f4cb0b

HarshaSatyavardhan force-pushed the issue-33/semantic-dedup branch from 84201b8 to 9f4cb0b Compare April 26, 2026 02:29

HarshaSatyavardhan changed the title ~~feat: add semantic deduplication post-processing~~ feat: add fuzzy deduplication post-processing Apr 26, 2026

HarshaSatyavardhan had a problem deploying to docker April 28, 2026 15:27 — with GitHub Actions Failure

HarshaSatyavardhan had a problem deploying to docker April 28, 2026 15:27 — with GitHub Actions Error

HarshaSatyavardhan added 2 commits May 5, 2026 09:30

feat: add exact-hash dedup pass before fuzzy

e0938f9

fix: address review on exact-pass dedup

ff85484

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add fuzzy deduplication post-processing#37

feat: add fuzzy deduplication post-processing#37
HarshaSatyavardhan wants to merge 3 commits into
EPFLiGHT:mainfrom
HarshaSatyavardhan:issue-33/semantic-dedup

HarshaSatyavardhan commented Apr 13, 2026 •

edited

Loading

Uh oh!

fabnemEPFL commented Apr 13, 2026

Uh oh!

fabnemEPFL commented Apr 13, 2026

Uh oh!

HarshaSatyavardhan commented Apr 26, 2026 •

edited

Loading

Uh oh!

fabnemEPFL commented Apr 29, 2026

Uh oh!

HarshaSatyavardhan commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

HarshaSatyavardhan commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Approach

Files changed

YAML config

Tested

Uh oh!

fabnemEPFL commented Apr 13, 2026

Uh oh!

fabnemEPFL commented Apr 13, 2026

Uh oh!

HarshaSatyavardhan commented Apr 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fabnemEPFL commented Apr 29, 2026

Uh oh!

HarshaSatyavardhan commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

HarshaSatyavardhan commented Apr 13, 2026 •

edited

Loading

HarshaSatyavardhan commented Apr 26, 2026 •

edited

Loading