-
Notifications
You must be signed in to change notification settings - Fork 1.8k
🧸 Fix unset tokenizer pad_token #3290
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 6 commits
d10e864
bdcadaf
30fbdf1
f2c1345
232d976
1bb5da6
ae21fc6
7a0cdfe
92209ee
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
|
@@ -312,7 +312,9 @@ def reward_func(completions, **kwargs): | |||||||||
Dataset to use for evaluation. It must meet the same requirements as `train_dataset`. | ||||||||||
processing_class ([`~transformers.PreTrainedTokenizerBase`], *optional*, defaults to `None`): | ||||||||||
Processing class used to process the data. The padding side must be set to "left". If `None`, the | ||||||||||
processing class is loaded from the model's name with [`~transformers.AutoTokenizer.from_pretrained`]. | ||||||||||
processing class is loaded from the model's name with [`~transformers.AutoTokenizer.from_pretrained`]. A padding | ||||||||||
token, `processing_class.pad_token`, must be defined. If not explicitly set, `processing_class.eos_token` will | ||||||||||
be used as the default padding token. | ||||||||||
reward_processing_classes (`Union[PreTrainedTokenizerBase, list[PreTrainedTokenizerBase]]`, *optional*, defaults to `None`): | ||||||||||
Processing classes corresponding to the reward functions specified in `reward_funcs`. Can be either: | ||||||||||
|
@@ -418,6 +420,8 @@ def __init__( | |||||||||
# Processing class | ||||||||||
if processing_class is None: | ||||||||||
processing_class = AutoTokenizer.from_pretrained(model.config._name_or_path, padding_side="left") | ||||||||||
if processing_class.pad_token is None: | ||||||||||
processing_class.pad_token = processing_class.eos_token | ||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We also want to run this when the processing class is passed, right? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, agreed. But the more I think about it, the less I like the idea of silently setting the padding token.... even if it's documented. |
||||||||||
|
||||||||||
# Reward functions | ||||||||||
if not isinstance(reward_funcs, list): | ||||||||||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we warn/inform the user of this default behaviour?