datakit: optional exact dedup in normalize by ravwojdyla-agent · Pull Request #4761 · marin-community/marin

ravwojdyla-agent · 2026-04-14T23:45:35Z

add dedup_mode: DedupMode to normalize_to_parquet and normalize_step so the per-shard id-dedup pass can be skipped
DedupMode is a StrEnum with EXACT (current behavior, default) and NONE (skip dedup) — leaves room for future modes (e.g. minhash) ¹
when DedupMode.NONE, the group_by reducer becomes a passthrough — records are still sharded and id-sorted, just not de-duplicated
dedup_mode is included in hash_attrs so cached step outputs distinguish runs by mode

default is EXACT, so existing call sites keep current behavior with no changes. ↩

ravwojdyla-agent · 2026-04-14T23:52:19Z

🤖 FYI — investigation of exact-text duplicates in FineWeb-Edu sample/10BT, motivating why this flag is useful (so callers with already-deduped data can opt out):

FineWeb-Edu sample/10BT exact-text duplicates

Bottom line: 4.18% of rows are exact-text duplicates (403,945 extra rows / 9,672,101 total).

Data source: gs://marin-tmp-us-central1/ttl=1d/investigate-exact-dups/dups-v2-20260414-222226/groups/ (expires ~2026-04-15 22:38 UTC).

Group statistics

381,972 unique groups with ≥2 occurrences
785,917 rows belong to a duplicate group (= 381,972 unique texts × avg 2.06 copies each)
"Extras beyond first occurrence" = 403,945 (matches the normalize→download row count drop exactly)

Group size distribution

Group size	# groups
2	361,228 (94.6%)
3	19,590 (5.1%)
4	1,091 (0.3%)
5	56
6	4
7	2
9	1

Almost all dups are pairs. No long tail.

URL pattern (the most interesting finding)

Same URL across dups: 256,620 groups (67%) — the same URL was crawled multiple times across different CC dumps and the page content was byte-identical
Different URLs across dups: 125,352 groups (33%) — true cross-site duplication (mirroring, syndication, copy-paste)

CC dump pattern

Different CC dumps: 381,969 groups (99.999%)
Same dump: only 3 groups

So virtually every duplicate is across-dump — this is overwhelmingly the same content captured at different points in time, not within-dump duplication.

Text length distribution

Length	# groups
<500 chars	4,624 (1.2%)
500–2k	125,765 (32.9%)
2k–10k	221,771 (58.0%)
≥10k	29,812 (7.8%)

Dups are mostly real articles (2k–10k chars), not short boilerplate. The short-text bucket is tiny.

Sample groups (all real, substantive content)

Same article on eurekalert.org via http and https crawls
NPR story syndicated from upr.org to hawaiipublicradio.org
Same blog post on argonelectronics.com re-crawled in two CC dumps
Two versions of FLASH code docs (4p22 vs 4p61) producing identical Logfile_create_F90.html
MIT African Tech article on Lake Malawi re-crawled
German Catholic history page that moved from lib.ndsu.nodak.edu to library.ndsu.edu

What this means

FineWeb-Edu sample/10BT's 4% exact dups are mostly the same web pages re-crawled in successive CommonCrawl dumps (2/3 of cases) and content syndicated/mirrored across sites (1/3). FineWeb's own dedup is per-dump, so cross-dump duplicates pass through. The dups are real long-form content, not template/boilerplate spam.

Output schema

Per row in the groups parquet:

text_hash: str — xxh3_128 hex of the text
text_sample: str — first 500 chars
text_len: int
count: int — number of occurrences
members: list<struct> — per-occurrence dict with id, url, dump, file_path, language, language_score, token_count, score

rjpower

LGTM

I'm a little wary of arguments accumulating a la default_tokenize/default_train, but perfect is the enemy of the good etc etc. More of a reaction to past horrors than this change.

Long-term a fancy-pants version of this might have something more like a set of stages like [Normalize -> Dedup -> ... ], but that's far too much complexity for what we need now.

rjpower · 2026-04-15T16:54:38Z

    num_shards: int,
    text_field: str,
    id_field: str | None,
+    exact_dedup: bool,


nit: I have almost never created a boolean argument without realizing I want to change it to an enum:

enum DedupMode = { NONE, EXACT }

🤖 Good call — switched to a DedupMode StrEnum (NONE, EXACT) in ce16520.

Add an exact_dedup flag to normalize_to_parquet and normalize_step (default True, preserving current behavior) so callers that have already deduplicated upstream — or that want to keep duplicates — can skip the dedup pass while still benefiting from sharding and id-sorted output. The flag is included in hash_attrs so cached step outputs distinguish dedup vs non-dedup runs.

The reducer doesn't sort — sorting happens upstream via group_by's sort_by. Rename to reflect what the function actually does.

Per review, switch from a boolean flag to a StrEnum (NONE, EXACT) so future dedup variants (e.g. minhash, suffix-array) have a place to land without another boolean explosion.

ravwojdyla-agent added the agent-generated Created by automation/agent label Apr 14, 2026

ravwojdyla-agent changed the title ~~[datakit] Make normalize exact dedup optional~~ datakit: optional exact dedup in normalize Apr 14, 2026

ravwojdyla requested a review from Helw150 April 14, 2026 23:54

rjpower approved these changes Apr 15, 2026

View reviewed changes

ravwojdyla added 3 commits April 15, 2026 19:08

[datakit] Rename dedup_and_sort reducer to dedup

643634b

The reducer doesn't sort — sorting happens upstream via group_by's sort_by. Rename to reflect what the function actually does.

[datakit] Replace exact_dedup bool with DedupMode enum

ce16520

Per review, switch from a boolean flag to a StrEnum (NONE, EXACT) so future dedup variants (e.g. minhash, suffix-array) have a place to land without another boolean explosion.

ravwojdyla force-pushed the worktree-rav-norm-exact-dedup-optional branch from a5384f1 to ce16520 Compare April 15, 2026 19:14

ravwojdyla merged commit 6bd0603 into main Apr 15, 2026
43 checks passed

ravwojdyla deleted the worktree-rav-norm-exact-dedup-optional branch April 15, 2026 19:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

datakit: optional exact dedup in normalize#4761

datakit: optional exact dedup in normalize#4761
ravwojdyla merged 3 commits intomainfrom
worktree-rav-norm-exact-dedup-optional

ravwojdyla-agent commented Apr 14, 2026 •

edited

Loading

Uh oh!

ravwojdyla-agent commented Apr 14, 2026

Uh oh!

rjpower left a comment

Uh oh!

rjpower Apr 15, 2026

Uh oh!

ravwojdyla-agent Apr 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ravwojdyla-agent commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Footnotes

Uh oh!

ravwojdyla-agent commented Apr 14, 2026

FineWeb-Edu sample/10BT exact-text duplicates

Group statistics

Group size distribution

URL pattern (the most interesting finding)

CC dump pattern

Text length distribution

Sample groups (all real, substantive content)

What this means

Output schema

Uh oh!

rjpower left a comment

Choose a reason for hiding this comment

Uh oh!

rjpower Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

ravwojdyla-agent Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ravwojdyla-agent commented Apr 14, 2026 •

edited

Loading