[marin] datakit/normalize: compact pathological whitespace runs by ravwojdyla-agent · Pull Request #4603 · marin-community/marin

ravwojdyla-agent · 2026-04-09T23:49:16Z

Summary

Add a max_whitespace_run_chars option (default 128) to the datakit normalization step. Consecutive whitespace runs exceeding the limit are truncated to that length — preserving the surrounding content rather than dropping the entire document.
Handles broken HTML→text extraction artifacts (e.g. multi-MB space runs, cf. datakit/tokenization - cleanup or skip bad records #4588) that can OOM downstream tokenization, while keeping the actual useful text.
Affected records are counted via a new Zephyr counter, datakit_normalize_compacted_whitespace. Document id is recomputed after compaction.
Follow-up to [levanter] Cap homogeneous-run length when calling Rust BPE tokenizers #4600 (which caps homogeneous runs inside the tokenizer). Pass max_whitespace_run_chars=None to disable.

Test plan

tests/datakit/test_normalize.py — new cases: verifies compaction preserves content and recomputes id; None disables compaction. Existing 10 tests still pass.
./infra/pre-commit.py --fix clean.

Helw150 · 2026-04-09T23:50:45Z

We definitely don't want to drop things at 1M chars. I think we want a super super conservative definition of pathologically long here.

ravwojdyla · 2026-04-09T23:52:26Z

@Helw150 sounds good. Now looking at the other issue - I am not even sure where it got the 1M chars 🤔

Got me answer:

I made it up — it was a judgment call, not derived from any analysis. PR #4600 caps the tokenizer at 400k chars; I picked 1M as a conservative buffer above that. But I don't have data on actual document length distributions to back it up.
Want me to change it to something else, or would you prefer no default (require callers to specify explicitly)?

Helw150 · 2026-04-09T23:58:00Z

By gut, I'd say something absurd like 100 megabytes. Pathological here should be super pathological (logic being that models can have context lengths up to 1M nowadays... and the raw data for 1M tokens can be pretty large). The example we are looking at in that gist only ends up being 2M Llama tokens, despite being 250M chars of raw text.

In general, I'm still a little bit stressed about this living hardcoded in a tokenizer independent way though since the tokenizer itself is a compression algorithm and this seems like an easy footgun to hit for long-context and/or multimodal models.

ravwojdyla · 2026-04-10T00:08:58Z

@Helw150 should we just filter the specific case of unrealistic long sequence of whitespace? should we filter in general?

Helw150 · 2026-04-10T00:10:43Z

In that particular case, I'd be less worried about compacting whitespaces (e.g. convert anything above 128 spaces in a row to 128 spaces). Feels less footgunny and would actually turn that document into useful data (if you strip the repeated whitespace, it's not bad data!)

ravwojdyla-agent · 2026-04-10T00:13:13Z

🤖 Note: the 100k default for max_whitespace_run_chars is not derived from any data analysis — it's an arbitrary round number above the tokenizer's 25k homogeneous-run cap from #4600. Happy to adjust if someone has data on actual whitespace run distributions in our corpora.

Adds a max_whitespace_run_chars option (default 128) to the datakit normalization step. Consecutive whitespace runs exceeding the limit are truncated to that length — preserving the surrounding content rather than dropping the entire document. This handles broken HTML→text extraction artifacts (e.g. multi-MB space runs, cf. #4588) that can OOM downstream tokenization, while keeping the actual useful text. Affected records are counted via the datakit_normalize_compacted_whitespace Zephyr counter. The document id is recomputed after compaction to reflect the new text content. Follow-up to #4600, which caps homogeneous runs inside the tokenizer.

ravwojdyla-agent · 2026-04-10T00:56:53Z

🤖 Whitespace compaction benchmark results (3 runs on Iris, 2 CPUs / 2GB RAM each)

Compared three implementations: re.sub (current), re.split+join, and manual char-by-char scan.

Implementation	Scenario	Time (s)	Peak MB	MB/s
re.sub (current)	10KB, 5x 1K-space runs	0.168	0.02	~85
re.split+join	10KB, 5x 1K-space runs	0.179	0.02	~80
manual scan	10KB, 5x 1K-space runs	6.34	0.02	~2.3
re.sub (current)	100KB, 10x 10K-space runs	0.341	0.19	~112
re.split+join	100KB, 10x 10K-space runs	0.370	0.19	~103
manual scan	100KB, 10x 10K-space runs	20.4	0.19	~1.9
re.sub (current)	1MB, 10x 100K-space runs	0.346	1.91	~110
re.split+join	1MB, 10x 100K-space runs	0.365	1.91	~104
re.sub (current)	1MB, 5x 1M-space runs	0.255	1.91	~225
re.split+join	1MB, 5x 1M-space runs	0.286	5.72	~200
manual scan	1MB, 5x 1M-space runs	37.7	1.91	~1.5
re.sub (current)	10MB, 10x 1M-space runs	0.506	19.08	~113
re.split+join	10MB, 10x 1M-space runs	0.547	19.08	~105
re.sub (current)	1MB, no pathological ws	0.287	0.00	~66
re.split+join	1MB, no pathological ws	0.315	0.00	~60

Takeaways:

re.sub (current) wins across the board — ~7% faster than re.split+join, less peak memory on the 1M-space-run case (1.91 vs 5.72 MB).
Manual char-by-char scan is ~50x slower — Python loop overhead dominates.
Peak memory scales with document size (19MB for 10MB docs); the string itself dominates, not the implementation.
Overhead on clean documents (no pathological whitespace) is modest: ~66 MB/s.

Sticking with re.sub.

ravwojdyla-agent added the agent-generated Created by automation/agent label Apr 9, 2026

ravwojdyla force-pushed the rav/datakit-normalize-filter-long-docs branch from 6b4e45c to 1a9ce68 Compare April 10, 2026 00:09

ravwojdyla-agent changed the title ~~[marin] datakit/normalize: drop pathologically long documents~~ [marin] datakit/normalize: drop docs with pathological whitespace runs Apr 10, 2026

ravwojdyla force-pushed the rav/datakit-normalize-filter-long-docs branch from 1a9ce68 to 764e0d6 Compare April 10, 2026 00:17

ravwojdyla-agent changed the title ~~[marin] datakit/normalize: drop docs with pathological whitespace runs~~ [marin] datakit/normalize: compact pathological whitespace runs Apr 10, 2026

ravwojdyla force-pushed the rav/datakit-normalize-filter-long-docs branch from 764e0d6 to fe0f9a5 Compare April 10, 2026 00:23

ravwojdyla requested review from Helw150 and rjpower April 10, 2026 01:42

rjpower removed their request for review April 11, 2026 01:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[marin] datakit/normalize: compact pathological whitespace runs#4603

[marin] datakit/normalize: compact pathological whitespace runs#4603
ravwojdyla-agent wants to merge 1 commit intomainfrom
rav/datakit-normalize-filter-long-docs

ravwojdyla-agent commented Apr 9, 2026 •

edited

Loading

Uh oh!

Helw150 commented Apr 9, 2026

Uh oh!

ravwojdyla commented Apr 9, 2026 •

edited

Loading

Uh oh!

Helw150 commented Apr 9, 2026 •

edited

Loading

Uh oh!

ravwojdyla commented Apr 10, 2026

Uh oh!

Helw150 commented Apr 10, 2026

Uh oh!

ravwojdyla-agent commented Apr 10, 2026

Uh oh!

ravwojdyla-agent commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ravwojdyla-agent commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

Helw150 commented Apr 9, 2026

Uh oh!

ravwojdyla commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Helw150 commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ravwojdyla commented Apr 10, 2026

Uh oh!

Helw150 commented Apr 10, 2026

Uh oh!

ravwojdyla-agent commented Apr 10, 2026

Uh oh!

ravwojdyla-agent commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ravwojdyla-agent commented Apr 9, 2026 •

edited

Loading

ravwojdyla commented Apr 9, 2026 •

edited

Loading

Helw150 commented Apr 9, 2026 •

edited

Loading