feat: DeepSeek V3.2 Off-policy sequence masking #4689

casinca · 2025-12-13T23:28:34Z

What does this PR do?

Fixes #4697

This PR aims to implement the Off-policy sequence masking from the DeepSeek V3.2 paper

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a GitHub issue? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

Changes

added an helper _get_off_policy_mask staticmethod to compute the off-policy mask
~~added logic in _compute_loss to inject the off-policy mask into the loss mask~~
added logic in _compute_loss to inject the off-policy mask only to the surrogate loss
added __post_init__ checks and some tests (including not allowing with Lieger loss atm)
added docs

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

Ensures the threshold is >= 0 in GRPOConfig to prevent invalid configuration.

…ernel` Since we need logprobs, compatibility should be done on the Liger side. This check prevent users thinking It would work with Liger loss

casinca · 2025-12-15T18:07:00Z

@qgallouedec 👋 I'd like to get your atttention on this commit: 861b490

Initially I had made a mistake by injecting the off-policy mask into the loss mask early. So I had changed the loss mask to be computed upstream, once.
Even if now, with the correct logic, I'm not touching the loss mask, I wanted to understand why is the loss mask recomputed multiple times? I don't see it being modified/overwritten anywhere downstream. In any case I can revert this specific commit, it won't change my implementation.
Additionally if the PR is good, I could add a metric to log for users, something like a off policy sequence drop ratio, so that they aren't blind and also get a better feel to adjust the delta/off_policy_treshold hparam

casinca added 14 commits December 14, 2025 00:26

adding off_policy_mask_threshold training arg

c9426b9

updated off_policy_mask_threshold docstring and training arg

c783190

added check in __post_init__ for off_policy_mask_threshold

474b3c5

Ensures the threshold is >= 0 in GRPOConfig to prevent invalid configuration.

added _get_off_policy_mask helper method

6a3f0dc

added test test_invalid_off_policy_threshold

795a97a

added check in __post_init__ incompatibility with `self.use_liger_k…

9a5fe20

…ernel` Since we need logprobs, compatibility should be done on the Liger side. This check prevent users thinking It would work with Liger loss

added test test_liger_kernel_compatibility_with_off_policy_masking

95581c0

feat(_compute_loss): optional off-policy sequence masking

909bee9

refactor(_compute_loss)! mask defined once up top?

861b490

fix test matching regex

2871fc6

docs: less conservative potential default value

1ccf494

added tests for off_policy_mask logic and training

2ab8842

docs(paper_index): added Off-Policy Masking subsection

81c6c3f

fix(_compute_loss): apply off_policy_mask only to the surrogate loss

10ddb03

casinca marked this pull request as ready for review December 15, 2025 17:57

Merge branch 'main' into off-policy-sequence-masking

502dcec

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: DeepSeek V3.2 Off-policy sequence masking #4689

feat: DeepSeek V3.2 Off-policy sequence masking #4689

casinca commented Dec 13, 2025 •

edited

Loading

Uh oh!

casinca commented Dec 15, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

feat: DeepSeek V3.2 Off-policy sequence masking #4689

Are you sure you want to change the base?

feat: DeepSeek V3.2 Off-policy sequence masking #4689

Conversation

casinca commented Dec 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

Changes

Who can review?

Uh oh!

casinca commented Dec 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

casinca commented Dec 13, 2025 •

edited

Loading

casinca commented Dec 15, 2025 •

edited

Loading