-
-
Notifications
You must be signed in to change notification settings - Fork 4k
Online DPO new changes #2794
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Online DPO new changes #2794
Conversation
unsloth/models/rl.py
Outdated
| else: | ||
| loss = None | ||
| with self.compute_loss_context_manager(): | ||
| tokenized_output = self.processing_class(inputs["prompt"], padding=True, truncation=True, return_tensors="pt").to(model.device) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tokenized_output? Should it be tokenized_input?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is correct, I just corrected it to inputs
|
@pluesclues Whenever its done, ping me! |
Datta0
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
unsloth/models/llama.py
Outdated
| padding_mask = None | ||
| elif self.training: | ||
| # elif attention_mask is None: | ||
| elif self.training and os.environ.get("UNSLOTH_KEEP_PADDING", "0") != '1': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re "UNSLOTH_KEEP_PADDING" where is this set exactly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would be set at the beginning of the script when loading so the attention calculations account for right padded tokens. This is something the user would set if using LLM reward models for Online DPO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way to make it automatic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a way to make this automatic by looking to see if any samples in the batch have right padded tokens and if they do, the flag is immediately enabled.
Because now that
AutoSequenceForClassificationor reward modeling support was added and I verified that the run works correctly, the PR is much more simplified to get Online DPO integrated in than the previous one.