datakit: optional exact dedup in normalize#4761
Conversation
|
🤖 FYI — investigation of exact-text duplicates in FineWeb-Edu sample/10BT, motivating why this flag is useful (so callers with already-deduped data can opt out): FineWeb-Edu sample/10BT exact-text duplicatesBottom line: 4.18% of rows are exact-text duplicates (403,945 extra rows / 9,672,101 total). Data source: Group statistics
Group size distribution
Almost all dups are pairs. No long tail. URL pattern (the most interesting finding)
CC dump pattern
So virtually every duplicate is across-dump — this is overwhelmingly the same content captured at different points in time, not within-dump duplication. Text length distribution
Dups are mostly real articles (2k–10k chars), not short boilerplate. The short-text bucket is tiny. Sample groups (all real, substantive content)
What this meansFineWeb-Edu sample/10BT's 4% exact dups are mostly the same web pages re-crawled in successive CommonCrawl dumps (2/3 of cases) and content syndicated/mirrored across sites (1/3). FineWeb's own dedup is per-dump, so cross-dump duplicates pass through. The dups are real long-form content, not template/boilerplate spam. Output schemaPer row in the groups parquet:
|
rjpower
left a comment
There was a problem hiding this comment.
LGTM
I'm a little wary of arguments accumulating a la default_tokenize/default_train, but perfect is the enemy of the good etc etc. More of a reaction to past horrors than this change.
Long-term a fancy-pants version of this might have something more like a set of stages like [Normalize -> Dedup -> ... ], but that's far too much complexity for what we need now.
| num_shards: int, | ||
| text_field: str, | ||
| id_field: str | None, | ||
| exact_dedup: bool, |
There was a problem hiding this comment.
nit: I have almost never created a boolean argument without realizing I want to change it to an enum:
enum DedupMode = { NONE, EXACT }
There was a problem hiding this comment.
🤖 Good call — switched to a DedupMode StrEnum (NONE, EXACT) in ce16520.
Add an exact_dedup flag to normalize_to_parquet and normalize_step (default True, preserving current behavior) so callers that have already deduplicated upstream — or that want to keep duplicates — can skip the dedup pass while still benefiting from sharding and id-sorted output. The flag is included in hash_attrs so cached step outputs distinguish dedup vs non-dedup runs.
The reducer doesn't sort — sorting happens upstream via group_by's sort_by. Rename to reflect what the function actually does.
Per review, switch from a boolean flag to a StrEnum (NONE, EXACT) so future dedup variants (e.g. minhash, suffix-array) have a place to land without another boolean explosion.
a5384f1 to
ce16520
Compare
dedup_mode: DedupModetonormalize_to_parquetandnormalize_stepso the per-shard id-dedup pass can be skippedDedupModeis aStrEnumwithEXACT(current behavior, default) andNONE(skip dedup) — leaves room for future modes (e.g. minhash) 1DedupMode.NONE, thegroup_byreducer becomes a passthrough — records are still sharded and id-sorted, just not de-duplicateddedup_modeis included inhash_attrsso cached step outputs distinguish runs by modeFootnotes
default is
EXACT, so existing call sites keep current behavior with no changes. ↩