Skip to content

Conversation

@jpizarrom
Copy link
Contributor

@jpizarrom jpizarrom commented Dec 7, 2025

What this does

This PR is an experiment to add VLA as BC actor into ACFQL on top of #1818

How it was tested

uv run python -m lerobot.rl.acfqlvla.learner \
    --config_path ../lerobot-configs-grocery-so100-fqlvla/train_gym_hil_env_fql_lilkm_pushfreq25x20251007.json \
    --wandb.enable=true \
    --policy.bc_policy=SmolVLA \
    --policy.vla_pretrained_name_or_path=lerobot/smolvla_base \
    --batch_size=64 \
    --policy.cache_observation_features_vla=true \
    --policy.cfg.enabled=true \
    --policy.recap_style_advantages=true \
    --save_freq=500
uv run python -m lerobot.rl.acfql.learner \
    --config_path ../lerobot-configs-grocery-so100-fqlvla/train_config_hilserl_so100_gamepad_ee_20fps.json \
    --job_name=7d-pre1m-25120222 \
    --wandb.project=hilserl-acfql-vla-recap-so100-grocery-butter-real-so100-20fps \
    --policy.offline_steps=1000000 \
    --policy.online_steps=0 \
    --policy.online_step_before_learning=5000 \
    --wandb.enable=true \
    --policy.actor_learner_config.policy_parameters_push_frequency=25 \
    --env.fps=20 \
    --save_replay_buffer_on_checkpoint=false \
    --save_offline_replay_buffer_on_checkpoint=false \
    --dataset.repo_id=jpizarrom/hilserl_so100_grocery_so100_2025112223_20 \
    --policy.offline_buffer_capacity=209290 \
    --policy.online_buffer_capacity=0 \
    --online_dataset=null \
    --policy.normalize_q_loss=null \
    --policy.storage_device_offline_replay_buffer=cuda \
    --policy.storage_device_replay_buffer=cpu \
    --policy.critic_grad_clip_norm=200.0 \
    --policy.actor_bc_grad_clip_norm=3.0 \
    --policy.actor_onestep_grad_clip_norm=800.0 \
    --policy.alpha=300.0 \
    --policy.shared_encoder=false \
    --policy.load_vlm_weights=false \
    --policy.chunk_size=10 \
    --policy.n_action_steps=10 \
    --policy.max_action_dim=32 \
    --policy.max_state_dim=32 \
    --policy.num_vlm_layers=16 \
    --policy.expert_width_multiplier=0.75 \
    --batch_size=64 \
    --policy.bc_policy=SmolVLA \
    --policy.cfg.enabled=true \
    --policy.recap_style_advantages=true \
    --save_freq=10000

TODO

  • use VLA + RECAP-style advantage signal as Behavior Cloning actor (continuous value net,resnet+mlp)
  • offline RL learner
  • train on task on sim only offline RL
  • train on task on real so100 only offline RL
  • online RL learner/actor
  • train on task on real so100 only offline RL + online RL
  • use VLA + RECAP-style advantage signal as Behavior Cloning actor (distributional value net)
  • share configs
  • tests coverage

also can be good idea

  • VLA as one-step actor
  • multi task transformer based critic
  • multi task transformer based value network

jpizarrom and others added 30 commits October 12, 2025 17:16
initialize offline buffer when offline_steps > 0
Co-authored-by: s1lent4gnt <[email protected]>
@jpizarrom jpizarrom changed the title [HIL-SERL] Add Flow Q-learning (FQL) agent with action chunking + VLA [HIL-SERL] Add Flow Q-learning (FQL) agent with action chunking + VLA as BC Dec 7, 2025
@jpizarrom jpizarrom changed the title [HIL-SERL] Add Flow Q-learning (FQL) agent with action chunking + VLA as BC [HIL-SERL] Add Flow Q-learning (FQL) agent with action chunking + VLA Dec 7, 2025
@jpizarrom jpizarrom changed the title [HIL-SERL] Add Flow Q-learning (FQL) agent with action chunking + VLA [WIP] [HIL-SERL] Add Flow Q-learning (FQL) agent with action chunking + VLA Dec 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants