🧸 Fix unset tokenizer pad_token #3290

LeonEricsson · 2025-04-14T14:00:00Z

What does this PR do?

This quietly solves the issue where an auto initialized tokenizer lacks a padding token, by setting it to the EOS token. I considered adding a pad_token as an input parameter to GRPOTrainer but at that point you may as well initialize the tokenizer yourself and pass it to GRPOTrainer as processing_class=.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a GitHub issue? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

LeonEricsson · 2025-04-14T14:01:13Z

trl/trainer/grpo_trainer.py

@@ -358,6 +358,8 @@ def __init__(
        # Processing class
        if processing_class is None:
            processing_class = AutoTokenizer.from_pretrained(model.config._name_or_path, padding_side="left")
+            if processing_class.pad_token is None:
+                processing_class.pad_token = processing_class.eos_token


Should we warn/inform the user of this default behaviour?

qgallouedec · 2025-04-15T18:55:00Z

For the record, setting the pad_token when it's not set doesn't seem to me to be good practice. For more context, see #3200.
Nevertheless, I can't think of a better solution here. But as you point out, the user must be informed. The best thing to do is to add a line to the processing_class argument doc, informing that if the processing class has no pad_token, then it will be set to eos.
Let's avoid warning, as it can be the desired behaviour

LeonEricsson · 2025-04-16T07:09:25Z

on second thought how about matching the approach of SFTTrainer and defining it as a config parameter?

trl/trl/trainer/sft_config.py

Lines 52 to 54 in 1e61f6c

    
                   pad_token (`int` or `None`, *optional*, defaults to `None`): 
        
                       Token used for padding. If `None`, it defaults to `processing_class.pad_token`, or if that is also `None`, 
        
                       it falls back to `processing_class.eos_token`.

this makes the default behaviour very explicit, i fear it would get drowned out in the processing_class doc.

my only gripe with the above is stuffing the config with too many parameters, how common is it for a user to want to use a custom padding token.

qgallouedec · 2025-04-16T13:53:49Z

on second thought how about matching the approach of SFTTrainer and defining it as a config parameter?

Yes, that's an option as well. But this one would require to change the way we pad the inputs. Currently we rely on the tokenizer for padding.

my only gripe with the above is stuffing the config with too many parameters, how common is it for a user to want to use a custom padding token.

Hardly never I would guess.

I'm good with both options

LeonEricsson · 2025-04-17T06:33:53Z

Chose to document the default behavior of pad_token = eos_token in the processing_class docstring. As you mentioned, the use case for a custom padding token doesn't seem common enough to justify adding a dedicated parameter.

qgallouedec · 2025-04-18T17:32:41Z

trl/trainer/grpo_trainer.py

+            if processing_class.pad_token is None:
+                processing_class.pad_token = processing_class.eos_token


Suggested change

if processing_class.pad_token is None:

processing_class.pad_token = processing_class.eos_token

if processing_class.pad_token is None:

processing_class.pad_token = processing_class.eos_token

We also want to run this when the processing class is passed, right?

Yes, agreed. But the more I think about it, the less I like the idea of silently setting the padding token.... even if it's documented.

qgallouedec

LGTM!

HuggingFaceDocBuilderDev · 2025-04-21T23:48:27Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

LeonEricsson and others added 2 commits April 14, 2025 14:57

tokenizer pad token default when not set

d10e864

nit

bdcadaf

LeonEricsson commented Apr 14, 2025

View reviewed changes

LeonEricsson changed the title ~~Tokenizer pad token~~ Fix unset tokenizer pad_token Apr 14, 2025

LeonEricsson and others added 4 commits April 17, 2025 08:35

docs

30fbdf1

Merge branch 'main' into tokenizer_pad_token

f2c1345

Merge remote-tracking branch 'upstream/main' into tokenizer_pad_token

232d976

Merge branch 'main' into tokenizer_pad_token

1bb5da6

qgallouedec reviewed Apr 18, 2025

View reviewed changes

LeonEricsson and others added 2 commits April 18, 2025 22:50

always check processing class for pad_token

ae21fc6

style

7a0cdfe

qgallouedec approved these changes Apr 21, 2025

View reviewed changes

Merge branch 'main' into tokenizer_pad_token

92209ee

qgallouedec changed the title ~~Fix unset tokenizer pad_token~~ 🧸 Fix unset tokenizer pad_token Apr 21, 2025

qgallouedec merged commit 1faa7f9 into huggingface:main Apr 22, 2025
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

🧸 Fix unset tokenizer pad_token #3290

🧸 Fix unset tokenizer pad_token #3290

Uh oh!

LeonEricsson commented Apr 14, 2025 •

edited

Loading

Uh oh!

LeonEricsson Apr 14, 2025 •

edited

Loading

Uh oh!

qgallouedec commented Apr 15, 2025

Uh oh!

LeonEricsson commented Apr 16, 2025 •

edited

Loading

Uh oh!

qgallouedec commented Apr 16, 2025

Uh oh!

LeonEricsson commented Apr 17, 2025 •

edited

Loading

Uh oh!

qgallouedec Apr 18, 2025

Uh oh!

qgallouedec Apr 18, 2025

Uh oh!

LeonEricsson Apr 18, 2025

Uh oh!

qgallouedec left a comment

Uh oh!

HuggingFaceDocBuilderDev commented Apr 21, 2025

Uh oh!

Uh oh!

Uh oh!

		if processing_class.pad_token is None:
		processing_class.pad_token = processing_class.eos_token

🧸 Fix unset tokenizer pad_token #3290

🧸 Fix unset tokenizer pad_token #3290

Uh oh!

Conversation

LeonEricsson commented Apr 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

Who can review?

Uh oh!

LeonEricsson Apr 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

qgallouedec commented Apr 15, 2025

Uh oh!

LeonEricsson commented Apr 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

qgallouedec commented Apr 16, 2025

Uh oh!

LeonEricsson commented Apr 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

qgallouedec Apr 18, 2025

Choose a reason for hiding this comment

Uh oh!

qgallouedec Apr 18, 2025

Choose a reason for hiding this comment

Uh oh!

LeonEricsson Apr 18, 2025

Choose a reason for hiding this comment

Uh oh!

qgallouedec left a comment

Choose a reason for hiding this comment

Uh oh!

HuggingFaceDocBuilderDev commented Apr 21, 2025

Uh oh!

Uh oh!

Uh oh!

LeonEricsson commented Apr 14, 2025 •

edited

Loading

LeonEricsson Apr 14, 2025 •

edited

Loading

LeonEricsson commented Apr 16, 2025 •

edited

Loading

LeonEricsson commented Apr 17, 2025 •

edited

Loading