Online DPO new changes #2794

pluesclues · 2025-06-23T23:21:39Z

Because now that AutoSequenceForClassification or reward modeling support was added and I verified that the run works correctly, the PR is much more simplified to get Online DPO integrated in than the previous one.

… trainers

Datta0 · 2025-06-24T06:25:13Z

unsloth/models/rl.py

+                else:
+                    loss = None
+                    with self.compute_loss_context_manager():
+                        tokenized_output = self.processing_class(inputs["prompt"], padding=True, truncation=True, return_tensors="pt").to(model.device)


tokenized_output? Should it be tokenized_input?

That is correct, I just corrected it to inputs

danielhanchen · 2025-06-25T22:39:41Z

@pluesclues Whenever its done, ping me!

Datta0

LGTM

danielhanchen · 2025-06-30T23:40:55Z

unsloth/models/llama.py

        padding_mask = None
-    elif self.training:
-    # elif attention_mask is None:
+    elif self.training and os.environ.get("UNSLOTH_KEEP_PADDING", "0") != '1':    


Re "UNSLOTH_KEEP_PADDING" where is this set exactly?

This would be set at the beginning of the script when loading so the attention calculations account for right padded tokens. This is something the user would set if using LLM reward models for Online DPO.

Is there a way to make it automatic?

There is a way to make this automatic by looking to see if any samples in the batch have right padded tokens and if they do, the flag is immediately enabled.

…htlyt

…ard modeling

pluesclues added 6 commits June 22, 2025 20:59

Kept, padding logic

f911c32

Made sure prediction step in rl.py allows logging for callbacks in RL…

2ba7f50

… trainers

Merge branch 'unslothai:main' into main

0c1bc4d

updated llama.py to new online_dpo changes

78336ce

Update rl.py to make logic simpiler

383aa9c

Update rl.py, made sure tokenized_output on eval step was on same device

532af4f

Datta0 reviewed Jun 24, 2025

View reviewed changes

Update rl.py, corrected tokenized_outputs to inputs

49f77c1

Update rl.py, removed sagemaker stuff

7921aa7

Datta0 approved these changes Jun 26, 2025

View reviewed changes

danielhanchen reviewed Jun 30, 2025

View reviewed changes

pluesclues added 10 commits July 2, 2025 18:40

Update llama.py, figures out if there is right padding automatically

54f03ee

Update llama.py, changed conditional statement for right padding slig…

a8d4168

…htlyt

Update llama.py, updated OS.environ variable to temp variable

236b924

Merge branch 'main' into main

76d73c6

Update rl.py, made it account for right padding in online dpo and rew…

fa2e18e

…ard modeling

Update llama.py, automatically figures out if right padding is needed

80f9cd2

Merge branch 'main' into main

ed1771a

Merge branch 'main' into main

49d3844

Merge branch 'unslothai:main' into main

b0a9c65

Merge branch 'unslothai:main' into main

6edcb0d

This was referenced Aug 19, 2025

Changes made in Unsloth and openInstruct to get a successful Online DPO run #1494

Open

Generations not changing when I backpropogate pytorch DPO loss #1415

Closed

Sampling from adapters different than sampling from saved merged model #1462

Closed

pluesclues added 3 commits August 22, 2025 09:33

Merge branch 'unslothai:main' into main

90c581b

Merge branch 'unslothai:main' into main

c3be985

Merge branch 'unslothai:main' into main

284a5de

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Online DPO new changes #2794

Online DPO new changes #2794

pluesclues commented Jun 23, 2025

Uh oh!

Datta0 Jun 24, 2025

Uh oh!

pluesclues Jun 24, 2025

Uh oh!

danielhanchen commented Jun 25, 2025

Uh oh!

Datta0 left a comment

Uh oh!

danielhanchen Jun 30, 2025

Uh oh!

pluesclues Jul 2, 2025

Uh oh!

danielhanchen Jul 2, 2025

Uh oh!

pluesclues Jul 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Online DPO new changes #2794

Are you sure you want to change the base?

Online DPO new changes #2794

Conversation

pluesclues commented Jun 23, 2025

Uh oh!

Datta0 Jun 24, 2025

Choose a reason for hiding this comment

Uh oh!

pluesclues Jun 24, 2025

Choose a reason for hiding this comment

Uh oh!

danielhanchen commented Jun 25, 2025

Uh oh!

Datta0 left a comment

Choose a reason for hiding this comment

Uh oh!

danielhanchen Jun 30, 2025

Choose a reason for hiding this comment

Uh oh!

pluesclues Jul 2, 2025

Choose a reason for hiding this comment

Uh oh!

danielhanchen Jul 2, 2025

Choose a reason for hiding this comment

Uh oh!

pluesclues Jul 2, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants