Vision-Language-Action (VLA) policy fine-tuning for Unitree G1 humanoid in NVIDIA Isaac Lab.
End-to-end VLA pipeline: data collection in Isaac Sim → LeRobot dataset format → SmolVLA fine-tuning → evaluation back in Isaac Sim.
Isaac Lab (G1 + UR5e) SmolVLA (450 M) Isaac Lab eval
- rollouts → episodes ┌─────────────────────────┐ ┌─────────────────┐
- cameras + actions │ HuggingFaceTB/SmolVLA │ │ HTTP server + │
└──────────┬─────────────► │ Vision-Language-Action │ ───► │ closed-loop │
│ │ + LeRobot trainer │ │ evaluation │
v └─────────────────────────┘ └─────────────────┘
data/lerobot_dataset
| Folder | Purpose |
|---|---|
data/ |
Episode collection + LeRobot conversion (convert_to_lerobot.py) |
train/ |
SmolVLA fine-tune launcher |
models/ |
smolvla_wrapper.py — inference wrapper around LeRobot SmolVLA |
eval/ |
Isaac Sim evaluation: eval_smolvla.py, eval_smolvla_http.py, smolvla_server.py |
envs/ |
UR5e pick-and-place env (Isaac Lab Direct workflow) |
vla_common/ |
Shared utilities (camera config, action chunking, ...) |
configs/ |
Experiment configs (finetune_smolvla.yaml, ...) |
checkpoints/ |
Optional pre-trained checkpoints (see "Pre-trained Checkpoints" below) |
| Component | Version |
|---|---|
| OS | Windows 11 |
| GPU | NVIDIA RTX (Blackwell: driver 591.74) |
| Python | 3.11 |
| Isaac Sim | 5.1.0 |
| Isaac Lab | 0.48.0 (release/2.3.0) |
| LeRobot | latest (with SmolVLA extras) |
conda create -n vla_train python=3.11 -y
conda activate vla_train
pip install lerobot[smolvla].\isaaclab.bat -p source\isaaclab_tasks\isaaclab_tasks\direct\g1_vla\scripts\collect_demos.py --num_episodes 200 --task pick_placeEpisodes are saved under data/raw_episodes/ (state, action, camera frames).
python source\isaaclab_tasks\isaaclab_tasks\direct\g1_vla\data\convert_to_lerobot.py ^
--input_dir data/raw_episodes ^
--output_dir data/lerobot_dataset ^
--fps 5python -m lerobot.scripts.lerobot_train ^
--config_path source/isaaclab_tasks/isaaclab_tasks/direct/g1_vla/configs/experiments/finetune_smolvla.yaml ^
--steps 20000Output checkpoints land in experiments/smolvla_finetune_*/checkpoints/.
# Start inference server (separate terminal)
python source\isaaclab_tasks\isaaclab_tasks\direct\g1_vla\eval\smolvla_server.py ^
--checkpoint experiments/smolvla_finetune_3000ep_seed456/checkpoints/last/pretrained_model ^
--host 127.0.0.1 --port 8765 --device cuda:0
# Run evaluator (in env_isaaclab env)
.\isaaclab.bat -p source\isaaclab_tasks\isaaclab_tasks\direct\g1_vla\eval\eval_smolvla_http.py ^
--enable_cameras --num_envs 1 --num_episodes 20 --headless ^
--server_url http://127.0.0.1:8765 --task "pick up the red cube" --seeds 42Or single-process evaluation:
.\isaaclab.bat -p source\isaaclab_tasks\isaaclab_tasks\direct\g1_vla\eval\eval_smolvla.py ^
--enable_cameras --num_envs 1 --num_episodes 20 --headless ^
--checkpoint experiments/smolvla_finetune_3000ep_seed456/checkpoints/last/pretrained_modelpolicy:
type: smolvla
pretrained_path: "HuggingFaceTB/SmolVLA-base"
action_chunk_size: 10
num_cameras: 1
image_size: [224, 224]
freeze_vision_encoder: true
training:
num_steps: 20000
batch_size: 32
lr: 1.0e-4- GPU: NVIDIA RTX 5070 Ti Laptop, 12 GB VRAM
- CPU: Intel i9-13900HX (24 C / 32 T)
- RAM: 64 GB DDR5-5200 dual-channel
- Memory budget during fine-tune: ~6–8 GB VRAM (BF16), ~10–17 GB RAM
- Throughput on this hardware: ~1.25 step/s, 20 K steps in ~9 hours
checkpoints/ is reserved for shared SmolVLA pretrained_model/ directories.
Fine-tuned models are typically too large for direct commit — preferred hosting:
- Hugging Face Hub model card (free, GPU-friendly download)
- Google Drive shared link (with
gdowninstructions)
When a stable checkpoint is published, this README will be updated with download instructions and direct usage commands.
MIT (see LICENSE).