feat(grpo_trainer.py): Variational Sequence-Level Soft Policy Optimization (VESPO)#5199
feat(grpo_trainer.py): Variational Sequence-Level Soft Policy Optimization (VESPO)#5199casinca wants to merge 8 commits intohuggingface:mainfrom
grpo_trainer.py): Variational Sequence-Level Soft Policy Optimization (VESPO)#5199Conversation
grpo_trainer.py): Variational Sequence-Level Soft Policy Optimization (VESPO)
|
I owe some better explanations to facilitate the review concerning importing From the original implementation below, the author is recomputing the
In order to avoid a 2nd log op in TRL, I'm directly clamping in logspace This is solely to follow the original implementation, otherwise I'm not really sure if reducing from If keeping the original logic and importing |
|
hey @FloyedShen , feel free to share any thoughts on this TRL implementation. |

What does this PR do?
This PR implements the VESPO loss, resolve #5196
Official implementation: https://github.com/FloyedShen/VESPO/blob/main/recipe/vespo/code/core_algos.py
Paper: https://huggingface.co/papers/2602.10693
Note:
The paper and the official implementation can have different variable names, to make things clearer:
Docstrings/comments are a mix of official impl and my writing.
Alternative options:
k_pos,lambda_pos,k_neg,lambda_negbut I could reduce with 2 tuples of 2 floats eg: lambdas (pos, neg) if it's better.w_seq. I can include it in metrics, but this would force me to return a tuple inget_gamma_weightsor remove@staticmethod. Not sure here what's the preference.For efficiency, the TRL VESPO implementation is slightly different than the official one. It's ~25% faster per call on gpu, and tested for equivalence.
Before submitting
Pull Request section?
to it if that's the case.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.
Note
Medium Risk
Changes core GRPO training-loss logic and its interaction with vLLM importance sampling, which could affect training stability and correctness across configurations.
Overview
Adds a new
loss_type="vespo"toGRPOTrainer, implementing VESPO’s detached sequence-level Gamma reweighting (get_gamma_weights) and integrating it into loss computation/normalization and metrics (vespo/phi_seq_mean).Extends
GRPOConfigwith four VESPO hyperparameters (vespo_k_*,vespo_lambda_*) and documents the new loss type; adds a vLLM guard requiringvllm_importance_sampling_modebetoken_truncateortoken_maskand skips the generic per-token IS correction path for VESPO.Updates unit tests to exercise the new loss type and adds a VESPO entry + example config to the paper index docs.
Written by Cursor Bugbot for commit bd7a4a1. This will update automatically on new commits. Configure here.