Skip to content

[marin] datakit/normalize: compact pathological whitespace runs#4603

Open
ravwojdyla-agent wants to merge 1 commit intomainfrom
rav/datakit-normalize-filter-long-docs
Open

[marin] datakit/normalize: compact pathological whitespace runs#4603
ravwojdyla-agent wants to merge 1 commit intomainfrom
rav/datakit-normalize-filter-long-docs

Conversation

@ravwojdyla-agent
Copy link
Copy Markdown
Contributor

@ravwojdyla-agent ravwojdyla-agent commented Apr 9, 2026

Summary

  • Add a max_whitespace_run_chars option (default 128) to the datakit normalization step. Consecutive whitespace runs exceeding the limit are truncated to that length — preserving the surrounding content rather than dropping the entire document.
  • Handles broken HTML→text extraction artifacts (e.g. multi-MB space runs, cf. datakit/tokenization - cleanup or skip bad records #4588) that can OOM downstream tokenization, while keeping the actual useful text.
  • Affected records are counted via a new Zephyr counter, datakit_normalize_compacted_whitespace. Document id is recomputed after compaction.
  • Follow-up to [levanter] Cap homogeneous-run length when calling Rust BPE tokenizers #4600 (which caps homogeneous runs inside the tokenizer). Pass max_whitespace_run_chars=None to disable.

Test plan

  • tests/datakit/test_normalize.py — new cases: verifies compaction preserves content and recomputes id; None disables compaction. Existing 10 tests still pass.
  • ./infra/pre-commit.py --fix clean.

@ravwojdyla-agent ravwojdyla-agent added the agent-generated Created by automation/agent label Apr 9, 2026
@Helw150
Copy link
Copy Markdown
Member

Helw150 commented Apr 9, 2026

We definitely don't want to drop things at 1M chars. I think we want a super super conservative definition of pathologically long here.

@ravwojdyla
Copy link
Copy Markdown
Contributor

ravwojdyla commented Apr 9, 2026

@Helw150 sounds good. Now looking at the other issue - I am not even sure where it got the 1M chars 🤔

Got me answer:

I made it up — it was a judgment call, not derived from any analysis. PR #4600 caps the tokenizer at 400k chars; I picked 1M as a conservative buffer above that. But I don't have data on actual document length distributions to back it up.
Want me to change it to something else, or would you prefer no default (require callers to specify explicitly)?

@Helw150
Copy link
Copy Markdown
Member

Helw150 commented Apr 9, 2026

By gut, I'd say something absurd like 100 megabytes. Pathological here should be super pathological (logic being that models can have context lengths up to 1M nowadays... and the raw data for 1M tokens can be pretty large). The example we are looking at in that gist only ends up being 2M Llama tokens, despite being 250M chars of raw text.

In general, I'm still a little bit stressed about this living hardcoded in a tokenizer independent way though since the tokenizer itself is a compression algorithm and this seems like an easy footgun to hit for long-context and/or multimodal models.

@ravwojdyla
Copy link
Copy Markdown
Contributor

@Helw150 should we just filter the specific case of unrealistic long sequence of whitespace? should we filter in general?

@ravwojdyla ravwojdyla force-pushed the rav/datakit-normalize-filter-long-docs branch from 6b4e45c to 1a9ce68 Compare April 10, 2026 00:09
@ravwojdyla-agent ravwojdyla-agent changed the title [marin] datakit/normalize: drop pathologically long documents [marin] datakit/normalize: drop docs with pathological whitespace runs Apr 10, 2026
@Helw150
Copy link
Copy Markdown
Member

Helw150 commented Apr 10, 2026

In that particular case, I'd be less worried about compacting whitespaces (e.g. convert anything above 128 spaces in a row to 128 spaces). Feels less footgunny and would actually turn that document into useful data (if you strip the repeated whitespace, it's not bad data!)

@ravwojdyla-agent
Copy link
Copy Markdown
Contributor Author

🤖 Note: the 100k default for max_whitespace_run_chars is not derived from any data analysis — it's an arbitrary round number above the tokenizer's 25k homogeneous-run cap from #4600. Happy to adjust if someone has data on actual whitespace run distributions in our corpora.

@ravwojdyla ravwojdyla force-pushed the rav/datakit-normalize-filter-long-docs branch from 1a9ce68 to 764e0d6 Compare April 10, 2026 00:17
@ravwojdyla-agent ravwojdyla-agent changed the title [marin] datakit/normalize: drop docs with pathological whitespace runs [marin] datakit/normalize: compact pathological whitespace runs Apr 10, 2026
Adds a max_whitespace_run_chars option (default 128) to the datakit
normalization step. Consecutive whitespace runs exceeding the limit are
truncated to that length — preserving the surrounding content rather
than dropping the entire document. This handles broken HTML→text
extraction artifacts (e.g. multi-MB space runs, cf. #4588) that can
OOM downstream tokenization, while keeping the actual useful text.

Affected records are counted via the
datakit_normalize_compacted_whitespace Zephyr counter. The document id
is recomputed after compaction to reflect the new text content.

Follow-up to #4600, which caps homogeneous runs inside the tokenizer.
@ravwojdyla ravwojdyla force-pushed the rav/datakit-normalize-filter-long-docs branch from 764e0d6 to fe0f9a5 Compare April 10, 2026 00:23
@ravwojdyla-agent
Copy link
Copy Markdown
Contributor Author

🤖 Whitespace compaction benchmark results (3 runs on Iris, 2 CPUs / 2GB RAM each)

Compared three implementations: re.sub (current), re.split+join, and manual char-by-char scan.

Implementation Scenario Time (s) Peak MB MB/s
re.sub (current) 10KB, 5x 1K-space runs 0.168 0.02 ~85
re.split+join 10KB, 5x 1K-space runs 0.179 0.02 ~80
manual scan 10KB, 5x 1K-space runs 6.34 0.02 ~2.3
re.sub (current) 100KB, 10x 10K-space runs 0.341 0.19 ~112
re.split+join 100KB, 10x 10K-space runs 0.370 0.19 ~103
manual scan 100KB, 10x 10K-space runs 20.4 0.19 ~1.9
re.sub (current) 1MB, 10x 100K-space runs 0.346 1.91 ~110
re.split+join 1MB, 10x 100K-space runs 0.365 1.91 ~104
re.sub (current) 1MB, 5x 1M-space runs 0.255 1.91 ~225
re.split+join 1MB, 5x 1M-space runs 0.286 5.72 ~200
manual scan 1MB, 5x 1M-space runs 37.7 1.91 ~1.5
re.sub (current) 10MB, 10x 1M-space runs 0.506 19.08 ~113
re.split+join 10MB, 10x 1M-space runs 0.547 19.08 ~105
re.sub (current) 1MB, no pathological ws 0.287 0.00 ~66
re.split+join 1MB, no pathological ws 0.315 0.00 ~60

Takeaways:

  • re.sub (current) wins across the board — ~7% faster than re.split+join, less peak memory on the 1M-space-run case (1.91 vs 5.72 MB).
  • Manual char-by-char scan is ~50x slower — Python loop overhead dominates.
  • Peak memory scales with document size (19MB for 10MB docs); the string itself dominates, not the implementation.
  • Overhead on clean documents (no pathological whitespace) is modest: ~66 MB/s.

Sticking with re.sub.

@ravwojdyla ravwojdyla requested review from Helw150 and rjpower April 10, 2026 01:42
@rjpower rjpower removed their request for review April 11, 2026 01:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agent-generated Created by automation/agent

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants