Fix get_wikitext2 tokenization bug causing sequence length warning by Mr-Neutr0n · Pull Request #2407 · huggingface/optimum

Mr-Neutr0n · 2026-02-09T18:39:48Z

Summary

Fixes #2020

get_wikitext2 was concatenating up to 1000 dataset entries into a single string and tokenizing it all at once. This produced a 73K+ token sequence that exceeds the model's maximum sequence length, triggering the warning:

Token indices sequence length is longer than the specified maximum sequence length for this model (73218 > 2048). Running this sequence through the model will result in indexing errors

This fix changes the function to tokenize individual samples instead, consistent with how get_c4 and get_c4_new already work. Each sample is tokenized separately and retried with a different random sample if it is shorter than the requested seqlen.

Changes

Removed concatenation of 1000 text entries into a single string
Removed bulk tokenization of the entire concatenated text
Added per-sample tokenization with retry loop (matching get_c4 / get_c4_new pattern)

Test plan

Verify get_wikitext2(tokenizer, nsamples=128, seqlen=32, split="train") no longer produces the sequence length warning
Verify the returned dataset has the correct number of samples and shape
Verify quantization workflows using wikitext2 dataset still work correctly

The function was concatenating up to 1000 dataset entries into a single string and tokenizing it all at once, producing a 73K+ token sequence that exceeds model maximum sequence length limits. Fix by tokenizing individual samples instead, consistent with how get_c4 and get_c4_new already work. Each sample is tokenized separately and retried if too short for the requested seqlen. Fixes huggingface#2020

Mr-Neutr0n · 2026-02-12T18:11:28Z

Friendly bump! Let me know if there's anything I should update or improve to help move this forward.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix get_wikitext2 tokenization bug causing sequence length warning#2407

Fix get_wikitext2 tokenization bug causing sequence length warning#2407
Mr-Neutr0n wants to merge 1 commit intohuggingface:mainfrom
Mr-Neutr0n:fix/get-wikitext2-tokenization-bug

Mr-Neutr0n commented Feb 9, 2026

Uh oh!

Mr-Neutr0n commented Feb 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Mr-Neutr0n commented Feb 9, 2026

Summary

Changes

Test plan

Uh oh!

Mr-Neutr0n commented Feb 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant