Version requirement: ms-swift>=3.10
REINFORCE++ Baseline is a simplified version of the REINFORCE++ algorithm, designed for outcome rewards (response-level scalar rewards). Similar to GRPO, it samples multiple model outputs for each prompt and uses an intra-group baseline to estimate advantages. The key difference lies in the statistics used for normalization.
For clarity, we explain REINFORCE++ Baseline by contrasting it with GRPO (Group Relative Policy Optimization).
Both GRPO and REINFORCE++ Baseline estimate advantages via intra-group comparisons. Their main differences are:
GRPO (Group Relative Policy Optimization)
For each prompt, GRPO generates
When scale_rewards='batch' is set, it uses the batch-level std of original rewards:
where
REINFORCE++ Baseline
For each prompt, REINFORCE++ generates
where
Key Difference:
-
GRPO: Uses the std of original rewards
$R$ for normalization -
REINFORCE++: Uses the std of group-mean-subtracted rewards
$\tilde{A}$ for normalization
Similar to RLOO, REINFORCE++ Baseline integrates KL divergence directly into the reward:
where beta), and
We can implement REINFORCE++ Baseline training by configuring the following parameters with GRPOTrainer:
--advantage_estimator reinforce_plus_plus
--scale_rewards batch
--kl_in_reward trueFor training examples, please refer to this script
-
--advantage_estimator: Selects the advantage estimation method-
grpo(default): Uses the std of original rewards for normalization -
reinforce_plus_plus: Uses the std of group-mean-subtracted rewards for normalization
-
-
--kl_in_reward: Controls where the KL divergence regularization term is applied-
false: KL divergence is an independent regularization term in the loss function (GRPO default) -
true: KL divergence is subtracted directly from the reward (REINFORCE++ original implementation)
-
-
--scale_rewards: Controls the normalization method-
group(default): Intra-group normalization -
batch: Global batch-level normalization (REINFORCE++ original implementation) -
none: No normalization
-
-
--num_generations: Number of samples generated per prompt ($G$ ) -
--beta: KL divergence regularization coefficient ($\beta$ )
For other parameters, please refer to GRPO Parameters