Skip to content

Fix get_wikitext2 tokenization bug causing sequence length warning#2407

Open
Mr-Neutr0n wants to merge 1 commit intohuggingface:mainfrom
Mr-Neutr0n:fix/get-wikitext2-tokenization-bug
Open

Fix get_wikitext2 tokenization bug causing sequence length warning#2407
Mr-Neutr0n wants to merge 1 commit intohuggingface:mainfrom
Mr-Neutr0n:fix/get-wikitext2-tokenization-bug

Conversation

@Mr-Neutr0n
Copy link

Summary

Fixes #2020

get_wikitext2 was concatenating up to 1000 dataset entries into a single string and tokenizing it all at once. This produced a 73K+ token sequence that exceeds the model's maximum sequence length, triggering the warning:

Token indices sequence length is longer than the specified maximum sequence length for this model (73218 > 2048). Running this sequence through the model will result in indexing errors

This fix changes the function to tokenize individual samples instead, consistent with how get_c4 and get_c4_new already work. Each sample is tokenized separately and retried with a different random sample if it is shorter than the requested seqlen.

Changes

  • Removed concatenation of 1000 text entries into a single string
  • Removed bulk tokenization of the entire concatenated text
  • Added per-sample tokenization with retry loop (matching get_c4 / get_c4_new pattern)

Test plan

  • Verify get_wikitext2(tokenizer, nsamples=128, seqlen=32, split="train") no longer produces the sequence length warning
  • Verify the returned dataset has the correct number of samples and shape
  • Verify quantization workflows using wikitext2 dataset still work correctly

The function was concatenating up to 1000 dataset entries into a single
string and tokenizing it all at once, producing a 73K+ token sequence
that exceeds model maximum sequence length limits.

Fix by tokenizing individual samples instead, consistent with how
get_c4 and get_c4_new already work. Each sample is tokenized separately
and retried if too short for the requested seqlen.

Fixes huggingface#2020
@Mr-Neutr0n
Copy link
Author

Friendly bump! Let me know if there's anything I should update or improve to help move this forward.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

get_wikitext2 has bug

1 participant