RL Stack Refactoring: Call for Contributions

## RL Stack Refactoring — Call for Contributions

We have been working on a major RL stack refactoring in LeRobot. The goal is to build a **solid RL foundation** that makes reinforcement learning and fine-tuning VLAs with RL as easy as imitation learning already is — and to make **adding new RL algorithms** straightforward for the community.

The first RL algorithm in LeRobot — **HIL-SERL** (RLPD/SAC with actor-learner architecture, reward classifier, and human interventions) — is working ([#504](https://github.com/huggingface/lerobot/issues/504), [PR #644](https://github.com/huggingface/lerobot/pull/644)). Now we want to:
1. **Solidify the foundations** and tighten the interfaces
2. **Make adding new RL algorithms easy** — one algorithm, one file
3. **Support VLA fine-tuning with RL** (ConRFT, QC-FQL, RECAP, etc)
4. **Integrate pluggable reward models** for both IL and RL

We welcome community feedback, contributions, and ideas. You don't need to write extensive code on your own — any input on a sub-component or comment, however small, is appreciated.

**Coordination:** Discord [#reinforcement-learning](https://discord.com/channels/1216765309076115607/1304507548204007576)

---

## Current Architecture

This architecture is introduced in the [RL refactoring PR](https://github.com/huggingface/lerobot/pull/3075). The RL stack lives in `src/lerobot/rl/` and follows LeRobot's existing patterns:

```
src/lerobot/rl/
├── algorithms/
│   ├── base.py          # RLAlgorithm ABC + RLAlgorithmConfig (draccus.ChoiceRegistry)
│   └── sac.py           # SAC training implementation
├── trainer.py           # RLTrainer (orchestrates training steps)
├── buffer.py            # ReplayBuffer
├── data_sources/
│   └── data_mixer.py    # DataMixer + OnlineOfflineMixer
├── learner.py           # Learner process (GPU training)
├── actor.py             # Actor process (environment interaction)
├── learner_service.py   # gRPC service for distributed training
└── ...
```

**Key design decisions already in place:**
- **`RLAlgorithmConfig`** uses `draccus.ChoiceRegistry` — same pattern as policies, cameras, motors
- **`RLAlgorithm.update(batch_iterator)`** — algorithms consume from an iterator, controlling batch consumption (e.g. UTD ratio)
- **Actor-Learner architecture** — distributed via gRPC, actors collect experience, learner trains
- **`DataMixer`** — mixes heterogeneous data sources (OnlineOfflineMixer mixes offline + online demonstrations)

---

## Roadmap

### Phase 1: Solidify Foundations (SAC stays working, interfaces tighten)
- [ ] Make `RLAlgorithmConfig` a first-class explicit config (decouple from `policy_cfg`)
- [ ] Clean up `RLAlgorithm` public API (`select_action`, `update`, `get_weights`/`load_weights`, `configure_data_iterator`)
- [ ] Improve documentation and add an RL training tutorial
- [ ] Add tests for the core RL components

### Phase 2: Pluggable Reward Models
- [ ] Create `src/lerobot/rewards/` package — shared reward models for both IL and RL
- [ ] Migrate existing reward classifier from `policies/sac/reward_model/`
- [ ] Migrate SARM from `policies/sarm/` (see [SampleWeighter refactor](https://github.com/huggingface/lerobot/pull/2781) for related IL abstraction)
- [ ] Support zero-shot reward models (TOPReward, VITA)
- [ ] **Papers:** [SARM](https://arxiv.org/abs/2509.25358), [TOPReward](https://arxiv.org/abs/2602.19313), [VITA](https://arxiv.org/abs/2506.10085)

### Phase 3: VLA Fine-Tuning Algorithms
- [ ] **RECAP** — Advantage-conditioned policies for heterogeneous data. Paper: [π*0.6](https://arxiv.org/abs/2511.14759). Community: [lerobot_recap](https://github.com/cijerezg/lerobot/tree/my-pi05-merge), [PR #2923](https://github.com/huggingface/lerobot/pull/2923)
- [ ] **QC_FQL** - Flow Q-learning (FQL) agent with action chunking (QC). Community: [PR #1818](https://github.com/huggingface/lerobot/pull/1818)
- [ ] **ConRFT** — Unified offline-online consistency policy. Community: [PR #1823](https://github.com/huggingface/lerobot/pull/1832)
- [ ] **DSRL** (low complexity) — Steer frozen diffusion policy via latent-noise SAC; builds on existing SAC infra. Paper: [arXiv:2506.15799](https://arxiv.org/abs/2506.15799)

---

## Open for Ideas

We want to hear from the community:
- What would make your life easier when training RL policies in LeRobot?
- What RL algorithms are you most excited about for robot learning?
- What other RL-based VLA fine-tuning methods should we support?
- What reward model approaches have worked best for your tasks?

Please **comment below** or **open a PR**. Contributions are highly encouraged!

Your help will make LeRobot a powerful and accessible framework for robot RL 🤖

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RL Stack Refactoring: Call for Contributions #3076

RL Stack Refactoring — Call for Contributions

Current Architecture

Roadmap

Phase 1: Solidify Foundations (SAC stays working, interfaces tighten)

Phase 2: Pluggable Reward Models

Phase 3: VLA Fine-Tuning Algorithms

Open for Ideas

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

RL Stack Refactoring: Call for Contributions #3076

Description

RL Stack Refactoring — Call for Contributions

Current Architecture

Roadmap

Phase 1: Solidify Foundations (SAC stays working, interfaces tighten)

Phase 2: Pluggable Reward Models

Phase 3: VLA Fine-Tuning Algorithms

Open for Ideas

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions