[marin] datakit/normalize: compact pathological whitespace runs#4603
[marin] datakit/normalize: compact pathological whitespace runs#4603ravwojdyla-agent wants to merge 1 commit intomainfrom
Conversation
|
We definitely don't want to drop things at 1M chars. I think we want a super super conservative definition of pathologically long here. |
|
@Helw150 sounds good. Now looking at the other issue - I am not even sure where it got the 1M chars 🤔 Got me answer:
|
|
By gut, I'd say something absurd like 100 megabytes. Pathological here should be super pathological (logic being that models can have context lengths up to 1M nowadays... and the raw data for 1M tokens can be pretty large). The example we are looking at in that gist only ends up being 2M Llama tokens, despite being 250M chars of raw text. In general, I'm still a little bit stressed about this living hardcoded in a tokenizer independent way though since the tokenizer itself is a compression algorithm and this seems like an easy footgun to hit for long-context and/or multimodal models. |
|
@Helw150 should we just filter the specific case of unrealistic long sequence of whitespace? should we filter in general? |
6b4e45c to
1a9ce68
Compare
|
In that particular case, I'd be less worried about compacting whitespaces (e.g. convert anything above 128 spaces in a row to 128 spaces). Feels less footgunny and would actually turn that document into useful data (if you strip the repeated whitespace, it's not bad data!) |
|
🤖 Note: the 100k default for |
1a9ce68 to
764e0d6
Compare
Adds a max_whitespace_run_chars option (default 128) to the datakit normalization step. Consecutive whitespace runs exceeding the limit are truncated to that length — preserving the surrounding content rather than dropping the entire document. This handles broken HTML→text extraction artifacts (e.g. multi-MB space runs, cf. #4588) that can OOM downstream tokenization, while keeping the actual useful text. Affected records are counted via the datakit_normalize_compacted_whitespace Zephyr counter. The document id is recomputed after compaction to reflect the new text content. Follow-up to #4600, which caps homogeneous runs inside the tokenizer.
764e0d6 to
fe0f9a5
Compare
|
🤖 Whitespace compaction benchmark results (3 runs on Iris, 2 CPUs / 2GB RAM each) Compared three implementations:
Takeaways:
Sticking with |
Summary
max_whitespace_run_charsoption (default 128) to the datakit normalization step. Consecutive whitespace runs exceeding the limit are truncated to that length — preserving the surrounding content rather than dropping the entire document.datakit_normalize_compacted_whitespace. Documentidis recomputed after compaction.max_whitespace_run_chars=Noneto disable.Test plan
tests/datakit/test_normalize.py— new cases: verifies compaction preserves content and recomputes id;Nonedisables compaction. Existing 10 tests still pass../infra/pre-commit.py --fixclean.