-
Notifications
You must be signed in to change notification settings - Fork 4k
Description
RL Stack Refactoring β Call for Contributions
We have been working on a major RL stack refactoring in LeRobot. The goal is to build a solid RL foundation that makes reinforcement learning and fine-tuning VLAs with RL as easy as imitation learning already is β and to make adding new RL algorithms straightforward for the community.
The first RL algorithm in LeRobot β HIL-SERL (RLPD/SAC with actor-learner architecture, reward classifier, and human interventions) β is working (#504, PR #644). Now we want to:
- Solidify the foundations and tighten the interfaces
- Make adding new RL algorithms easy β one algorithm, one file
- Support VLA fine-tuning with RL (ConRFT, QC-FQL, RECAP, etc)
- Integrate pluggable reward models for both IL and RL
We welcome community feedback, contributions, and ideas. You don't need to write extensive code on your own β any input on a sub-component or comment, however small, is appreciated.
Coordination: Discord #reinforcement-learning
Current Architecture
This architecture is introduced in the RL refactoring PR. The RL stack lives in src/lerobot/rl/ and follows LeRobot's existing patterns:
src/lerobot/rl/
βββ algorithms/
β βββ base.py # RLAlgorithm ABC + RLAlgorithmConfig (draccus.ChoiceRegistry)
β βββ sac.py # SAC training implementation
βββ trainer.py # RLTrainer (orchestrates training steps)
βββ buffer.py # ReplayBuffer
βββ data_sources/
β βββ data_mixer.py # DataMixer + OnlineOfflineMixer
βββ learner.py # Learner process (GPU training)
βββ actor.py # Actor process (environment interaction)
βββ learner_service.py # gRPC service for distributed training
βββ ...
Key design decisions already in place:
RLAlgorithmConfigusesdraccus.ChoiceRegistryβ same pattern as policies, cameras, motorsRLAlgorithm.update(batch_iterator)β algorithms consume from an iterator, controlling batch consumption (e.g. UTD ratio)- Actor-Learner architecture β distributed via gRPC, actors collect experience, learner trains
DataMixerβ mixes heterogeneous data sources (OnlineOfflineMixer mixes offline + online demonstrations)
Roadmap
Phase 1: Solidify Foundations (SAC stays working, interfaces tighten)
- Make
RLAlgorithmConfiga first-class explicit config (decouple frompolicy_cfg) - Clean up
RLAlgorithmpublic API (select_action,update,get_weights/load_weights,configure_data_iterator) - Improve documentation and add an RL training tutorial
- Add tests for the core RL components
Phase 2: Pluggable Reward Models
- Create
src/lerobot/rewards/package β shared reward models for both IL and RL - Migrate existing reward classifier from
policies/sac/reward_model/ - Migrate SARM from
policies/sarm/(see SampleWeighter refactor for related IL abstraction) - Support zero-shot reward models (TOPReward, VITA)
- Papers: SARM, TOPReward, VITA
Phase 3: VLA Fine-Tuning Algorithms
- RECAP β Advantage-conditioned policies for heterogeneous data. Paper: Ο*0.6. Community: lerobot_recap, PR #2923
- QC_FQL - Flow Q-learning (FQL) agent with action chunking (QC). Community: PR #1818
- ConRFT β Unified offline-online consistency policy. Community: PR #1823
- DSRL (low complexity) β Steer frozen diffusion policy via latent-noise SAC; builds on existing SAC infra. Paper: arXiv:2506.15799
Open for Ideas
We want to hear from the community:
- What would make your life easier when training RL policies in LeRobot?
- What RL algorithms are you most excited about for robot learning?
- What other RL-based VLA fine-tuning methods should we support?
- What reward model approaches have worked best for your tasks?
Please comment below or open a PR. Contributions are highly encouraged!
Your help will make LeRobot a powerful and accessible framework for robot RL π€