Getting started with VLA? This guide takes you from the foundations to the frontier — diffusion and flow matching, state-of-the-art robot foundation model architectures, data scaling, RL fine-tuning, and world models. Papers in reading order.
- Basic probability & optimization (enough to follow ELBO, score matching derivations)
- Deep learning fundamentals (Transformers, attention, tokenization)
- 💡 Starting from scratch? MIT 6.S191 — Intro to Deep Learning covers CNNs, Transformers, and generative models in a 1-week intensive bootcamp. More courses below.
- Paper presentation: 1–2 participants per week, 30 min/paper — architecture, training, key results
- Discussion: Compare design choices across the week's papers, discuss limitations and open questions (15–20 min)
| Phase | Weeks | Topic | Readings |
|---|---|---|---|
| Phase 1 | W1–3 | Generative Model Foundations | MIT 6.S184 course |
| Phase 2 | W4–5 | Early Foundation RFMs & Robot Policy | RT-1, RT-2, Octo, OpenVLA, BeT, Diffusion Policy, ACT |
| Phase 3 | W6–7 | Current RFM Architectures | CogACT, GR00T N1, X-VLA, π0, InternVLA-M1 |
| Phase 4 | W8–9 | Data Scaling | OXE, AgiBot World, UMI, VITRA, Human to Robot Transfer |
| Phase 5 | W10–11 | Efficient Inference & Dual-System | RTC, SmolVLA, Helix, Fast-in-Slow |
| Phase 6 | W12–14 | RL Fine-tuning, Reasoning & World Model | HIL-SERL, SimpleVLA-RL, π*0.6, CoT-VLA, ThinkAct, Fast-ThinkAct, UniVLA, Cosmos Policy, DreamZero |
📚 Core Material: MIT 6.S184 — Introduction to Flow Matching and Diffusion Models (Holderrieth & Erives, MIT CSAIL, 2025) | Course notes paper
| Material | Topic |
|---|---|
| Lectures 1–2 | ODE/SDE basics, forward/reverse processes, conditional/marginal probability paths |
| Lab 1 | Hands-on SDE simulation |
| Material | Topic |
|---|---|
| Lectures 3–4 | Flow Matching, Score Matching, guidance, classifier-free guidance |
| Labs 2–3 | Building a toy diffusion model from scratch |
| Material | Topic |
|---|---|
| Lecture 5 | Guest lecture by Benjamin Burchfiel (Toyota Research): diffusion models for robotics |
| Lecture 6 | Generative protein design (optional) |
| # | Paper | Link | Key Topic |
|---|---|---|---|
| 1 | RT-1: Robotics Transformer — Brohan et al. (2022) | 2212.06817 | First large-scale Robotics Transformer (no VLM) |
| 2 | RT-2: Vision-Language-Action Models — Brohan et al. (2023) | 2307.15818 | VLM backbone → VLA paradigm |
| 3 | Octo — Ghosh et al. (2024) | 2405.12213 | Open-source generalist policy, modular design, pretrained on OXE (no VLM) |
| 4 | OpenVLA — Kim et al. (2024) | 2406.09246 | First open-source VLM-based VLA |
📎 Supplementary video: Stanford CS25 V3 — Low-level Embodied Intelligence
Key points: RT-1 (35M, no VLM) → RT-2 (55B VLM, action as text tokens) establishes the VLA concept. Octo (27M–93M, diffusion head, no VLM) and OpenVLA (7B, VLM + 256-bin discretization) are the first open-source generalist robot policies enabling community iteration.
| # | Paper | Link | Key Topic |
|---|---|---|---|
| 5 | Behavior Transformers (BeT) — Shafiullah et al. (2022) | 2206.11251 | Multimodal action discretization, k-means + offset |
| 6 | Diffusion Policy — Chi et al. (2023) | 2303.04137 | Diffusion for robot control, action sequence prediction |
| 7 | ACT/ALOHA — Zhao et al. (2023) | 2304.13705 | Action Chunking Transformer, CVAE, bimanual |
Key points: Three approaches to the multimodal action problem. Action chunking (predicting K future actions at once) is foundational for later VLA work.
| # | Paper | Link | Key Topic |
|---|---|---|---|
| 8 | CogACT — Li et al. (2024) | 2411.19650 | VLM + DiT action head, action token learning |
| 9 | GR00T N1 — Bjorck et al. (2025) | 2503.14734 | 2B diffusion transformer, whole-body humanoid control |
| 10 | X-VLA — Zheng et al. (2025) | 2510.10274 | Soft prompts for cross-embodiment, Florence-Large + flow matching |
Key points: All three use only the VLM's last hidden state to drive a separate action head.
| # | Paper | Link | Key Topic |
|---|---|---|---|
| 11 | π0 — Black et al. (2024) | 2410.24164 | Flow matching + action expert accessing VLM intermediate features |
| 12 | InternVLA-M1 — Chen et al. (2025) | 2510.13778 | Spatial grounding → action generation, AR-based |
📎 Background: Transfusion — Zhou et al. (2024) | 2408.11039 — AR + diffusion in one transformer; π0's architectural basis
Key points: Unlike Week 6's action heads that only see the VLM's last hidden state, these action experts access VLM internal hidden states.
| # | Paper | Link | Key Topic |
|---|---|---|---|
| 13 | Open X-Embodiment (OXE) — Open X-Embodiment Collaboration (2023) | 2310.08864 | 1M+ trajectories, 22 embodiments, standardized data format |
| 14 | AgiBot World — Bu et al. (2025) | 2503.06669 | 1M+ trajectories, 217 tasks, 5 deployment scenarios |
📎 Data formats — Recording-oriented: rosbag (ROS 1), mcap (vendor-neutral, ROS 2 default). Training-oriented: RLDS (TensorFlow/OXE standard), LeRobotDataset (HuggingFace, Parquet + video).
📎 From the Evolution of Rosbag to the Future of AI Tooling — by the original rosbag author; covers rosbag V1→V2 → rosbag2 (sqlite3) → MCAP evolution
Key points: Large-scale multi-embodiment datasets that enable generalist robot policy pretraining. OXE standardized the data format across 22 robot embodiments via RLDS; AgiBot World provides high-quality data at scale.
| # | Paper | Link | Key Topic |
|---|---|---|---|
| 15 | UMI — Chi et al. (2024) | 2402.10329 | Robot-free SE(3) data collection via handheld gripper |
| 16 | VITRA — Li et al. (2025) | 2510.21571 | Human video → VLA training data (1M episodes from egocentric human videos) |
| 17 | Human to Robot Transfer — Kareer et al. (2025) | 2512.22414 | Human video → robot transfer emerges with VLA scaling |
Key points: Three data sources beyond robot teleoperation — UMI (embodiment-agnostic physical demos, <$200 hardware), egocentric video, and exocentric video.
| # | Paper | Link | Key Topic |
|---|---|---|---|
| 18 | SmolVLA — Shukor et al. (2025) | 2506.01844 | 450M params (~1/7 of π0), model compression + async inference |
| 19 | RTC — Black et al. (2025) | 2506.07339 | Async inference — freezing + inpainting, no retraining needed |
Key points: Two complementary approaches — SmolVLA compresses the model itself, RTC optimizes the inference pipeline. Can be combined.
| # | Paper | Link | Key Topic |
|---|---|---|---|
| 20 | Helix — Figure AI (2025) | figure.ai/news/helix | S2: 7B VLM @7-9Hz, S1: 80M @200Hz, humanoid |
| 21 | Fast-in-Slow — Chen et al. (2025) | 2506.01953 | Integrated dual-system, end-to-end trainable |
Key points: Dual-System separates slow reasoning (VLM) from fast execution (lightweight policy) at different frequencies. Helix (separately trained) vs Fast-in-Slow (end-to-end trainable).
| # | Paper | Link | Key Topic |
|---|---|---|---|
| 22 | HIL-SERL — Luo et al. (2024) | 2410.21845 | Human-in-the-loop RL, sample-efficient real-world training |
| 23 | SimpleVLA-RL — Li et al. (2025) | 2509.09674 | RL fine-tuning for autoregressive VLA, outcome-based rewards |
| 24 | π*0.6 / Recap — Physical Intelligence (2025) | 2511.14759 | RL for flow-based VLA, advantage-conditioned, learns from suboptimal data |
Key points: Three RL approaches — HIL-SERL (human-in-the-loop, sample-efficient), SimpleVLA-RL (outcome rewards), π*0.6 (advantage-conditioned, learns from suboptimal data).
| # | Paper | Link | Key Topic |
|---|---|---|---|
| 25 | CoT-VLA — Zhao et al. (2025) | 2503.22020 | Visual chain-of-thought reasoning (future image prediction) before action |
| 26 | ThinkAct — Huang et al. (2025) | 2507.16815 | Decouple reasoning from execution; RL grounds plan quality in task success, not language supervision |
| 27 | Fast-ThinkAct — Huang et al. (2026) | 2601.09708 | Text-level CoT dispensable — latent distillation preserves planning capacity at ~10× speed |
📎 Fast-ThinkAct's reasoning compression is orthogonal to Week 10's model compression (SmolVLA, RTC) — the two can stack.
Key points: Reasoning representation — image tokens (CoT-VLA) vs. visual latent (ThinkAct) vs. compressed latent tokens (Fast-ThinkAct). ThinkAct grounds reasoning in task-outcome RL instead of language supervision. Fast-ThinkAct shows planning structure, not verbosity, carries the signal (~10× faster, performance preserved).
| # | Paper | Link | Key Topic |
|---|---|---|---|
| 28 | UniVLA — Wang et al. (2025) | 2506.19850 | Unified AR VLA with world modeling as training objective |
| 29 | Cosmos Policy — Kim et al. (2026) | 2601.16163 | Pretrained video foundation model as robot policy backbone |
| 30 | DreamZero — Ye et al. (2026) | dreamzero0.github.io | World Action Model, joint world+action generation in latent space |
Key points: Three ways to leverage world knowledge — training regularizer (UniVLA, no world prediction at inference), pretrained video FM as policy backbone (Cosmos Policy), joint world+action generation in latent space (DreamZero).
Suggestions for papers, resources, or structural improvements are welcome — please open an issue or PR.
- 🔥 vla0-trl — A complete VLA in ~1,200 lines of Python. Fine-tunes Qwen2.5-VL with TRL's SFTTrainer to predict actions as text, scoring ~90% on LIBERO. Read the entire codebase in an afternoon.
- Awesome-RL-VLA — RL for VLA models
- Awesome-VLA-Robotics — Large-scale VLA paper collection
Courses covering the prerequisites for this study guide — only those with recent (2023+) video lectures freely available on YouTube. Pick what you need.
| Area | Course | Instructor | Link | Notes |
|---|---|---|---|---|
| DL Fundamentals | MIT 6.S191: Intro to Deep Learning | Alexander Amini | introtodeeplearning.com · YouTube '25 | 1-week bootcamp (10 lectures) — CNN, Transformer, generative models, RL |
| Andrej Karpathy: Neural Networks: Zero to Hero | Andrej Karpathy | karpathy.ai/zero-to-hero.html · YouTube | Backprop → GPT, build everything from scratch in code | |
| Vision | Stanford CS231n: DL for Computer Vision | Fei-Fei Li et al. | cs231n.stanford.edu · YouTube '25 | The canonical CV course — backprop to detection/segmentation/video |
| NLP / Transformers | Stanford CS224n: NLP with Deep Learning | Christopher Manning | web.stanford.edu/class/cs224n · YouTube '24 | Word vectors → Transformers → LLMs |
| RL | UC Berkeley CS285: Deep RL | Sergey Levine | rail.eecs.berkeley.edu/deeprlcourse · YouTube '23 | Policy gradients, Q-learning, model-based & offline RL — by a leading robotics RL researcher |