Skip to content

RL Stack Refactoring: Call for ContributionsΒ #3076

@s1lent4gnt

Description

@s1lent4gnt

RL Stack Refactoring β€” Call for Contributions

We have been working on a major RL stack refactoring in LeRobot. The goal is to build a solid RL foundation that makes reinforcement learning and fine-tuning VLAs with RL as easy as imitation learning already is β€” and to make adding new RL algorithms straightforward for the community.

The first RL algorithm in LeRobot β€” HIL-SERL (RLPD/SAC with actor-learner architecture, reward classifier, and human interventions) β€” is working (#504, PR #644). Now we want to:

  1. Solidify the foundations and tighten the interfaces
  2. Make adding new RL algorithms easy β€” one algorithm, one file
  3. Support VLA fine-tuning with RL (ConRFT, QC-FQL, RECAP, etc)
  4. Integrate pluggable reward models for both IL and RL

We welcome community feedback, contributions, and ideas. You don't need to write extensive code on your own β€” any input on a sub-component or comment, however small, is appreciated.

Coordination: Discord #reinforcement-learning


Current Architecture

This architecture is introduced in the RL refactoring PR. The RL stack lives in src/lerobot/rl/ and follows LeRobot's existing patterns:

src/lerobot/rl/
β”œβ”€β”€ algorithms/
β”‚   β”œβ”€β”€ base.py          # RLAlgorithm ABC + RLAlgorithmConfig (draccus.ChoiceRegistry)
β”‚   └── sac.py           # SAC training implementation
β”œβ”€β”€ trainer.py           # RLTrainer (orchestrates training steps)
β”œβ”€β”€ buffer.py            # ReplayBuffer
β”œβ”€β”€ data_sources/
β”‚   └── data_mixer.py    # DataMixer + OnlineOfflineMixer
β”œβ”€β”€ learner.py           # Learner process (GPU training)
β”œβ”€β”€ actor.py             # Actor process (environment interaction)
β”œβ”€β”€ learner_service.py   # gRPC service for distributed training
└── ...

Key design decisions already in place:

  • RLAlgorithmConfig uses draccus.ChoiceRegistry β€” same pattern as policies, cameras, motors
  • RLAlgorithm.update(batch_iterator) β€” algorithms consume from an iterator, controlling batch consumption (e.g. UTD ratio)
  • Actor-Learner architecture β€” distributed via gRPC, actors collect experience, learner trains
  • DataMixer β€” mixes heterogeneous data sources (OnlineOfflineMixer mixes offline + online demonstrations)

Roadmap

Phase 1: Solidify Foundations (SAC stays working, interfaces tighten)

  • Make RLAlgorithmConfig a first-class explicit config (decouple from policy_cfg)
  • Clean up RLAlgorithm public API (select_action, update, get_weights/load_weights, configure_data_iterator)
  • Improve documentation and add an RL training tutorial
  • Add tests for the core RL components

Phase 2: Pluggable Reward Models

  • Create src/lerobot/rewards/ package β€” shared reward models for both IL and RL
  • Migrate existing reward classifier from policies/sac/reward_model/
  • Migrate SARM from policies/sarm/ (see SampleWeighter refactor for related IL abstraction)
  • Support zero-shot reward models (TOPReward, VITA)
  • Papers: SARM, TOPReward, VITA

Phase 3: VLA Fine-Tuning Algorithms

  • RECAP β€” Advantage-conditioned policies for heterogeneous data. Paper: Ο€*0.6. Community: lerobot_recap, PR #2923
  • QC_FQL - Flow Q-learning (FQL) agent with action chunking (QC). Community: PR #1818
  • ConRFT β€” Unified offline-online consistency policy. Community: PR #1823
  • DSRL (low complexity) β€” Steer frozen diffusion policy via latent-noise SAC; builds on existing SAC infra. Paper: arXiv:2506.15799

Open for Ideas

We want to hear from the community:

  • What would make your life easier when training RL policies in LeRobot?
  • What RL algorithms are you most excited about for robot learning?
  • What other RL-based VLA fine-tuning methods should we support?
  • What reward model approaches have worked best for your tasks?

Please comment below or open a PR. Contributions are highly encouraged!

Your help will make LeRobot a powerful and accessible framework for robot RL πŸ€–

Metadata

Metadata

Assignees

Labels

configurationProblems with configuration files or settingsdocumentationImprovements or fixes to the project’s docspoliciesItems related to robot policiesrlsensorsEverything related to sensorstestsProblems with test coverage, failures, or improvements to testingtrainingIssues related at training time

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions