Skip to content

🧸 Fix unset tokenizer pad_token #3290

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Apr 22, 2025

Conversation

LeonEricsson
Copy link
Contributor

@LeonEricsson LeonEricsson commented Apr 14, 2025

What does this PR do?

Fixes #3287

This quietly solves the issue where an auto initialized tokenizer lacks a padding token, by setting it to the EOS token. I considered adding a pad_token as an input parameter to GRPOTrainer but at that point you may as well initialize the tokenizer yourself and pass it to GRPOTrainer as processing_class=.

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a GitHub issue? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@@ -358,6 +358,8 @@ def __init__(
# Processing class
if processing_class is None:
processing_class = AutoTokenizer.from_pretrained(model.config._name_or_path, padding_side="left")
if processing_class.pad_token is None:
processing_class.pad_token = processing_class.eos_token
Copy link
Contributor Author

@LeonEricsson LeonEricsson Apr 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we warn/inform the user of this default behaviour?

@LeonEricsson LeonEricsson changed the title Tokenizer pad token Fix unset tokenizer pad_token Apr 14, 2025
@qgallouedec
Copy link
Member

For the record, setting the pad_token when it's not set doesn't seem to me to be good practice. For more context, see #3200.
Nevertheless, I can't think of a better solution here. But as you point out, the user must be informed. The best thing to do is to add a line to the processing_class argument doc, informing that if the processing class has no pad_token, then it will be set to eos.
Let's avoid warning, as it can be the desired behaviour

@LeonEricsson
Copy link
Contributor Author

LeonEricsson commented Apr 16, 2025

on second thought how about matching the approach of SFTTrainer and defining it as a config parameter?

pad_token (`int` or `None`, *optional*, defaults to `None`):
Token used for padding. If `None`, it defaults to `processing_class.pad_token`, or if that is also `None`,
it falls back to `processing_class.eos_token`.

this makes the default behaviour very explicit, i fear it would get drowned out in the processing_class doc.

my only gripe with the above is stuffing the config with too many parameters, how common is it for a user to want to use a custom padding token.

@qgallouedec
Copy link
Member

on second thought how about matching the approach of SFTTrainer and defining it as a config parameter?

Yes, that's an option as well. But this one would require to change the way we pad the inputs. Currently we rely on the tokenizer for padding.

my only gripe with the above is stuffing the config with too many parameters, how common is it for a user to want to use a custom padding token.

Hardly never I would guess.

I'm good with both options

@LeonEricsson
Copy link
Contributor Author

LeonEricsson commented Apr 17, 2025

Chose to document the default behavior of pad_token = eos_token in the processing_class docstring. As you mentioned, the use case for a custom padding token doesn't seem common enough to justify adding a dedicated parameter.

Comment on lines 423 to 424
if processing_class.pad_token is None:
processing_class.pad_token = processing_class.eos_token
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if processing_class.pad_token is None:
processing_class.pad_token = processing_class.eos_token
if processing_class.pad_token is None:
processing_class.pad_token = processing_class.eos_token

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also want to run this when the processing class is passed, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, agreed. But the more I think about it, the less I like the idea of silently setting the padding token.... even if it's documented.

Copy link
Member

@qgallouedec qgallouedec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@qgallouedec qgallouedec changed the title Fix unset tokenizer pad_token 🧸 Fix unset tokenizer pad_token Apr 21, 2025
@qgallouedec qgallouedec merged commit 1faa7f9 into huggingface:main Apr 22, 2025
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[GRPOTrainer] Asking to pad but the tokenizer does not have a padding token
3 participants