-
Notifications
You must be signed in to change notification settings - Fork 337
Open
Description
Describe the problem
I'm currently using GLM4 with the latest version of HuggingFace's transformers library in a P-Tuning experiment. While preparing input batches, I encountered the following error:
This happens when I try to use .pad(padding_side="right") — a common approach in HuggingFace to pad a batch of tokenized inputs.
My use case
I'm following the HuggingFace-style batching process for fine-tuning, where .pad() is typically used to ensure consistent input shapes. But the GLM4 tokenizer appears to lack support for padding_side, and perhaps even .pad() behavior in general.
What I’ve tried
- Looked into the tokenizer code — it seems that
GLMTokenizerdoes not inherit the usualpad()method behavior fromPreTrainedTokenizerFast. - Tried manually padding the input sequences, but I’m concerned about whether that matches GLM4’s expected behavior, particularly for
attention_maskandposition_ids.
Questions
- What's the recommended way to apply padding when using GLM4 tokenizer?
- Is there a compatible data collator or tokenizer wrapper that supports HuggingFace-style padding?
- Would manually implementing padding + masks be sufficient, or is there a better way to ensure compatibility?
Environment:
- GLM model: GLM4
- Transformers version: latest
- OS: Ubuntu
Thanks a lot for your help!
Metadata
Metadata
Assignees
Labels
No labels