Skip to content

datakit: optional exact dedup in normalize#4761

Merged
ravwojdyla merged 3 commits intomainfrom
worktree-rav-norm-exact-dedup-optional
Apr 15, 2026
Merged

datakit: optional exact dedup in normalize#4761
ravwojdyla merged 3 commits intomainfrom
worktree-rav-norm-exact-dedup-optional

Conversation

@ravwojdyla-agent
Copy link
Copy Markdown
Contributor

@ravwojdyla-agent ravwojdyla-agent commented Apr 14, 2026

  • add dedup_mode: DedupMode to normalize_to_parquet and normalize_step so the per-shard id-dedup pass can be skipped
  • DedupMode is a StrEnum with EXACT (current behavior, default) and NONE (skip dedup) — leaves room for future modes (e.g. minhash) 1
  • when DedupMode.NONE, the group_by reducer becomes a passthrough — records are still sharded and id-sorted, just not de-duplicated
  • dedup_mode is included in hash_attrs so cached step outputs distinguish runs by mode

Footnotes

  1. default is EXACT, so existing call sites keep current behavior with no changes.

@ravwojdyla-agent ravwojdyla-agent added the agent-generated Created by automation/agent label Apr 14, 2026
@ravwojdyla-agent ravwojdyla-agent changed the title [datakit] Make normalize exact dedup optional datakit: optional exact dedup in normalize Apr 14, 2026
@ravwojdyla-agent
Copy link
Copy Markdown
Contributor Author

🤖 FYI — investigation of exact-text duplicates in FineWeb-Edu sample/10BT, motivating why this flag is useful (so callers with already-deduped data can opt out):

FineWeb-Edu sample/10BT exact-text duplicates

Bottom line: 4.18% of rows are exact-text duplicates (403,945 extra rows / 9,672,101 total).

Data source: gs://marin-tmp-us-central1/ttl=1d/investigate-exact-dups/dups-v2-20260414-222226/groups/ (expires ~2026-04-15 22:38 UTC).

Group statistics

  • 381,972 unique groups with ≥2 occurrences
  • 785,917 rows belong to a duplicate group (= 381,972 unique texts × avg 2.06 copies each)
  • "Extras beyond first occurrence" = 403,945 (matches the normalize→download row count drop exactly)

Group size distribution

Group size # groups
2 361,228 (94.6%)
3 19,590 (5.1%)
4 1,091 (0.3%)
5 56
6 4
7 2
9 1

Almost all dups are pairs. No long tail.

URL pattern (the most interesting finding)

  • Same URL across dups: 256,620 groups (67%) — the same URL was crawled multiple times across different CC dumps and the page content was byte-identical
  • Different URLs across dups: 125,352 groups (33%) — true cross-site duplication (mirroring, syndication, copy-paste)

CC dump pattern

  • Different CC dumps: 381,969 groups (99.999%)
  • Same dump: only 3 groups

So virtually every duplicate is across-dump — this is overwhelmingly the same content captured at different points in time, not within-dump duplication.

Text length distribution

Length # groups
<500 chars 4,624 (1.2%)
500–2k 125,765 (32.9%)
2k–10k 221,771 (58.0%)
≥10k 29,812 (7.8%)

Dups are mostly real articles (2k–10k chars), not short boilerplate. The short-text bucket is tiny.

Sample groups (all real, substantive content)

  • Same article on eurekalert.org via http and https crawls
  • NPR story syndicated from upr.org to hawaiipublicradio.org
  • Same blog post on argonelectronics.com re-crawled in two CC dumps
  • Two versions of FLASH code docs (4p22 vs 4p61) producing identical Logfile_create_F90.html
  • MIT African Tech article on Lake Malawi re-crawled
  • German Catholic history page that moved from lib.ndsu.nodak.edu to library.ndsu.edu

What this means

FineWeb-Edu sample/10BT's 4% exact dups are mostly the same web pages re-crawled in successive CommonCrawl dumps (2/3 of cases) and content syndicated/mirrored across sites (1/3). FineWeb's own dedup is per-dump, so cross-dump duplicates pass through. The dups are real long-form content, not template/boilerplate spam.

Output schema

Per row in the groups parquet:

  • text_hash: str — xxh3_128 hex of the text
  • text_sample: str — first 500 chars
  • text_len: int
  • count: int — number of occurrences
  • members: list<struct> — per-occurrence dict with id, url, dump, file_path, language, language_score, token_count, score

@ravwojdyla ravwojdyla requested a review from Helw150 April 14, 2026 23:54
Copy link
Copy Markdown
Collaborator

@rjpower rjpower left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

I'm a little wary of arguments accumulating a la default_tokenize/default_train, but perfect is the enemy of the good etc etc. More of a reaction to past horrors than this change.

Long-term a fancy-pants version of this might have something more like a set of stages like [Normalize -> Dedup -> ... ], but that's far too much complexity for what we need now.

num_shards: int,
text_field: str,
id_field: str | None,
exact_dedup: bool,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I have almost never created a boolean argument without realizing I want to change it to an enum:

enum DedupMode = { NONE, EXACT }

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Good call — switched to a DedupMode StrEnum (NONE, EXACT) in ce16520.

Add an exact_dedup flag to normalize_to_parquet and normalize_step
(default True, preserving current behavior) so callers that have
already deduplicated upstream — or that want to keep duplicates — can
skip the dedup pass while still benefiting from sharding and id-sorted
output. The flag is included in hash_attrs so cached step outputs
distinguish dedup vs non-dedup runs.
The reducer doesn't sort — sorting happens upstream via group_by's
sort_by. Rename to reflect what the function actually does.
Per review, switch from a boolean flag to a StrEnum (NONE, EXACT) so
future dedup variants (e.g. minhash, suffix-array) have a place to
land without another boolean explosion.
@ravwojdyla ravwojdyla force-pushed the worktree-rav-norm-exact-dedup-optional branch from a5384f1 to ce16520 Compare April 15, 2026 19:14
@ravwojdyla ravwojdyla merged commit 6bd0603 into main Apr 15, 2026
43 checks passed
@ravwojdyla ravwojdyla deleted the worktree-rav-norm-exact-dedup-optional branch April 15, 2026 19:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agent-generated Created by automation/agent

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants