Add RLOO and REINFORCE++ Advantage Estimators as GRPO Extensions

### ⚠️ Please check that this feature request hasn't been suggested before.

- [x] I searched previous [Ideas in Discussions](https://github.com/axolotl-ai-cloud/axolotl/discussions/categories/ideas) didn't find any similar feature requests.
- [x] I searched previous [Issues](https://github.com/axolotl-ai-cloud/axolotl/labels/enhancement) didn't find any similar feature requests.

### 🔖 Feature description

 RLOO (REINFORCE Leave-One-Out) and REINFORCE++ are closely related online-RL algorithms that share the same rollout-and-reward structure but use different baselines/advantage estimation

### ✔️ Solution

-Add an advantage-estimator selector either by extending loss_type 
- Implement the differing advantage computation in the GRPO strategy/trainer


### ❓ Alternatives

NA

### 📝 Additional Context

NA

### Acknowledgements

- [x] My issue title is concise, descriptive, and in title casing.
- [x] I have searched the existing issues to make sure this feature has not been requested yet.
- [x] I have provided enough information for the maintainers to understand and evaluate this request.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add RLOO and REINFORCE++ Advantage Estimators as GRPO Extensions #3676

⚠️ Please check that this feature request hasn't been suggested before.

🔖 Feature description

✔️ Solution

❓ Alternatives

📝 Additional Context

Acknowledgements

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Add RLOO and REINFORCE++ Advantage Estimators as GRPO Extensions #3676

Description

⚠️ Please check that this feature request hasn't been suggested before.

🔖 Feature description

✔️ Solution

❓ Alternatives

📝 Additional Context

Acknowledgements

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions