Last updated: Sep 28, 2025
Author: Honghua DONG
REINFORCE Leave One-Out (RLOO), introduced by Ahmadian et al. (2024), is an RL method that removes the need for a value function (critic). Instead, it estimates the baseline by averaging rewards of other sampled responses for the same prompt within the group.
The overall core objective is:
For more details:
-
AReal Detail: Paper of AReal
-
RLOO Detail: Paper of RLOO
We only list the different parameters from GRPO here:
actor.adv_norm.mean_level: The level when calculate the mean of advantage. options:group,batchornone. In rloo, it is set togroupby default.actor.adv_norm.mean_leave1out: Whether to use leave-one-out average. In rloo, it is set totrueby default.actor.adv_norm.std_level: The level when calculate the std of advantage. options:group,batchornone. In rloo, it is set tononeby default.
We recommend to change the parameter within the configuration file (i.e.gsm8k_rloo.yaml).
| Backend | CMD |
|---|---|
| local | python3 -m areal.launcher.local examples/math/gsm8k_ppo.py --config examples/math/gsm8k_rloo.yaml --<other_args_to_overwrite> |
| ray | python3 -m areal.launcher.ray examples/math/gsm8k_ppo.py --config examples/math/gsm8k_rloo.yaml --<other_args_to_overwrite> |
| slurm | python3 -m areal.launcher.slurm examples/math/gsm8k_ppo.py --config examples/math/gsm8k_rloo.yaml --<other_args_to_overwrite> |
We still lack baseline, welcome to contribute!
