⚠️ Please check that this feature request hasn't been suggested before.
🔖 Feature description
RLOO (REINFORCE Leave-One-Out) and REINFORCE++ are closely related online-RL algorithms that share the same rollout-and-reward structure but use different baselines/advantage estimation
✔️ Solution
-Add an advantage-estimator selector either by extending loss_type
- Implement the differing advantage computation in the GRPO strategy/trainer
❓ Alternatives
NA
📝 Additional Context
NA
Acknowledgements
🔖 Feature description
RLOO (REINFORCE Leave-One-Out) and REINFORCE++ are closely related online-RL algorithms that share the same rollout-and-reward structure but use different baselines/advantage estimation
✔️ Solution
-Add an advantage-estimator selector either by extending loss_type
❓ Alternatives
NA
📝 Additional Context
NA
Acknowledgements