Skip to content

feat: add fuzzy deduplication post-processing#37

Open
HarshaSatyavardhan wants to merge 3 commits into
EPFLiGHT:mainfrom
HarshaSatyavardhan:issue-33/semantic-dedup
Open

feat: add fuzzy deduplication post-processing#37
HarshaSatyavardhan wants to merge 3 commits into
EPFLiGHT:mainfrom
HarshaSatyavardhan:issue-33/semantic-dedup

Conversation

@HarshaSatyavardhan
Copy link
Copy Markdown
Contributor

@HarshaSatyavardhan HarshaSatyavardhan commented Apr 13, 2026

Closes #33

Summary

  • Optional fuzzy deduplication as a post-merge step, using character n-gram MinHash + LSH (datasketch)
  • CPU-only, no GPU/torch dependency
  • Configurable via DeduplicationParams in YAML (deduplication_params.enabled: true)
  • Tunable knobs: threshold, num_perm, shingle_size, text_field
  • Optional dependency group: pip install -e '.[dedup]'
  • Streaming "first-seen wins" pass — linear time, lazy-imported (no overhead when disabled)

Approach

For each row in the merged dataset I build the set of character n-grams, compute a MinHash signature, and query an LSH index built so far. If a near-duplicate is already indexed (Jaccard above threshold), the row is dropped; otherwise it's inserted and kept.

Files changed

  • pyproject.toml[dedup] extras = ["datasketch>=1.6.0"]
  • src/mmirage/config/config.pyDeduplicationParams (enabled, text_field, threshold, num_perm, shingle_size)
  • src/mmirage/core/postprocess/fuzzy_dedup.py — MinHash-LSH dedup logic
  • src/mmirage/merge_shards.py — dedup hook in merge_dataset_dir / merge_input_dir / merge_from_config, plus a --config flag on the CLI
  • tests/test_dedup.py — TinyStories smoke test with --limit / --threshold / --num-perm / --shingle-size
  • configs/config_comprehensive.yamldeduplication_params example block
  • README.md — "Fuzzy deduplication (optional)" section

YAML config

deduplication_params:
  enabled: true
  text_field: text
  threshold: 0.85       # Jaccard similarity threshold
  num_perm: 128         # MinHash signature size
  shingle_size: 5       # character n-gram size

Tested

  • Synthetic near-duplicate dataset: detects and drops the near-duplicate, keeps distinct rows.
  • Missing datasketch: clean ImportError with install hint.
  • enabled: false (default): datasketch is not imported, zero overhead.
  • Rebased on current main; integrates with the merge API from Feature/merge #35.

@fabnemEPFL
Copy link
Copy Markdown
Collaborator

Hi @HarshaSatyavardhan it would be great for the future to wait for approval of the issues you submit before spending time on implementing them. You can ping again if several days have passed with no life sign from me, I'm busy with several different projects and can't monitor MMIRAGE every day but I should check at least once a week.
But this feature sounds pretty promising, I'm reviewing now

@fabnemEPFL
Copy link
Copy Markdown
Collaborator

After some thought, I think the embedding-based similarity is pretty overkill. Character-based similarity should be more accurate, faster and less computationally intensive

@HarshaSatyavardhan HarshaSatyavardhan changed the title feat: add semantic deduplication post-processing feat: add fuzzy deduplication post-processing Apr 26, 2026
@HarshaSatyavardhan
Copy link
Copy Markdown
Contributor Author

HarshaSatyavardhan commented Apr 26, 2026

@fabnemEPFL switched to character-based fuzzy dedup as you suggested. It's now char n-gram MinHash + LSH via datasketch — single small dep, CPU-only, lazy-imported when disabled.

I also rebased onto current main so the dedup hook lives in the new merge_dataset_dir / merge_from_config paths from #35, and updated the PR title and body to match.

One thing worth flagging: this branch is rebased on main, so it temporarily reincludes the typing.override issue in jsonl.py on Python 3.10/3.11. The fix is in #30; once that lands this branch picks it up automatically.

@fabnemEPFL
Copy link
Copy Markdown
Collaborator

Hi Harsha, thanks for the work on this. After trying this on the test dataset you provided and realizing it was still quite slow, I think I found an idea to make the algorithm faster:
assuming duplicates can also be exact, a way could be to compute first the hash of each sample and then find easily which are identical. Then once this is done, let the script perform the character-based fuzzy dedup (could be enabled/disabled depending on whether the user cares only about exact match).

@HarshaSatyavardhan
Copy link
Copy Markdown
Contributor Author

Addressed the slowness as you suggested

  • added an exact-hash pre-pass before fuzzy (hashlib.blake2b over normalized text), and switched the MinHash loop to update_batch for an additional speedup. Both stages are independently toggleable via exact / fuzzy flags; with enabled: true and no sub-flags set

  • Added an orchestrator-level test in tests/test_dedup.py covering all four (exact, fuzzy) combos plus the no-op guard

  • Tests pass on Python 3.12+; on 3.10/3.11 the test file currently can't import the package due to the unrelated typing.override issue tracked in #30

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature: Add deduplication stage to the processing pipeline

2 participants