Skip to content

Add RLOO and REINFORCE++ Advantage Estimators as GRPO Extensions #3676

Description

@ved1beta

⚠️ Please check that this feature request hasn't been suggested before.

  • I searched previous Ideas in Discussions didn't find any similar feature requests.
  • I searched previous Issues didn't find any similar feature requests.

🔖 Feature description

RLOO (REINFORCE Leave-One-Out) and REINFORCE++ are closely related online-RL algorithms that share the same rollout-and-reward structure but use different baselines/advantage estimation

✔️ Solution

-Add an advantage-estimator selector either by extending loss_type

  • Implement the differing advantage computation in the GRPO strategy/trainer

❓ Alternatives

NA

📝 Additional Context

NA

Acknowledgements

  • My issue title is concise, descriptive, and in title casing.
  • I have searched the existing issues to make sure this feature has not been requested yet.
  • I have provided enough information for the maintainers to understand and evaluate this request.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions