Skip to content

[Tokenizer] GLM4 tokenizer does not support padding_side in .pad() (transformers latest version) #210

@GitHub119ymj

Description

@GitHub119ymj

Describe the problem

I'm currently using GLM4 with the latest version of HuggingFace's transformers library in a P-Tuning experiment. While preparing input batches, I encountered the following error:

This happens when I try to use .pad(padding_side="right") — a common approach in HuggingFace to pad a batch of tokenized inputs.

My use case

I'm following the HuggingFace-style batching process for fine-tuning, where .pad() is typically used to ensure consistent input shapes. But the GLM4 tokenizer appears to lack support for padding_side, and perhaps even .pad() behavior in general.

What I’ve tried

  • Looked into the tokenizer code — it seems that GLMTokenizer does not inherit the usual pad() method behavior from PreTrainedTokenizerFast.
  • Tried manually padding the input sequences, but I’m concerned about whether that matches GLM4’s expected behavior, particularly for attention_mask and position_ids.

Questions

  1. What's the recommended way to apply padding when using GLM4 tokenizer?
  2. Is there a compatible data collator or tokenizer wrapper that supports HuggingFace-style padding?
  3. Would manually implementing padding + masks be sufficient, or is there a better way to ensure compatibility?

Environment:

  • GLM model: GLM4
  • Transformers version: latest
  • OS: Ubuntu

Thanks a lot for your help!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions