MolmoMotion is a 4B vision-language model that forecasts 3D point
trajectories under natural-language action instructions. Given a short
RGB observation history, a set of user-specified 2D query points with their
initial 3D positions, and a language description of the intended action,
the model predicts each query point's 3D trajectory for up to ~2 seconds
in the camera-frame-at-t₀ coordinate frame. We show that the learned
motion prior transfers to robotics planning and to motion-guided video
generation.
This repository covers the autoregressive (AR) variant from the paper, together with the MolmoMotion-1M training corpus and the PointMotionBench evaluation suite. See the paper or the blog post for the full methodology and results.
- Setup
- Quick Start
- Data and benchmark construction
- Training
- Evaluation
- HuggingFace Conversion
- Robotics: MolmoBot finetuning
- Citation
- License
git clone https://github.com/allenai/molmo-motion.git
cd molmo-motion
conda create -n molmo-motion python=3.11 -y
conda activate molmo-motion
pip install -e .[viz]GPU / driver note.
pip installpulls the default PyTorch wheels, which target the newest CUDA runtime and may not match an older driver (torch.cuda.is_available()then returnsFalse). If so, install a torch build matching your driver from the PyTorch index, e.g. for a CUDA 12.8 driver:pip install "torch==2.9.1" torchvision "torchcodec==0.9.*" \ --index-url https://download.pytorch.org/whl/cu128
torchcodecmust match the torch minor version (0.9.x ↔ torch 2.9). The video decoder also needs FFmpeg shared libraries on the system (conda install -c conda-forge ffmpeg); the bundled training/eval recipes decode with OpenCV and do not require it, but thetorchcodec_exactpath does.
Installation registers three console scripts:
| Command | Purpose |
|---|---|
molmo-motion-train |
torchrun-compatible YAML-config training driver (the released recipes use torchrun launch_scripts/sft.py directly — see Training) |
molmo-motion-eval |
torchrun-compatible YAML-config evaluation driver |
molmo-motion-convert-hf |
OLMo-native → HuggingFace checkpoint converter |
Training and evaluation read two separate corpora from HuggingFace:
| Path | Used for | HF repo |
|---|---|---|
MOLMO_MOTION_1M_ROOT |
Training | allenai/molmo-motion-1m |
POINTMOTIONBENCH_ROOT |
Evaluation | allenai/PointMotionBench |
Export both before running anything in this README. The two roots must
point at new directories outside this repo — they will be populated
by hf download below. Do not point them at
dataset_recipes/ or
pointmotionbench/, which only hold recipes and
documentation.
export MOLMO_MOTION_1M_ROOT=/your/path/to/molmo-motion-1m
export POINTMOTIONBENCH_ROOT=/your/path/to/PointMotionBenchDownload:
# Training corpus.
hf download allenai/molmo-motion-1m \
--repo-type dataset --local-dir $MOLMO_MOTION_1M_ROOT
# Evaluation benchmark — only needed for `launch_scripts/eval_pointmotionbench.py`.
hf download allenai/PointMotionBench \
--repo-type dataset --local-dir $POINTMOTIONBENCH_ROOTLayout after download:
$MOLMO_MOTION_1M_ROOT/
├── egodex/ annotations/ tracks/ camera/
├── ytvis/ annotations/ tracks/ camera/
├── hdepic/ annotations/ tracks/ camera/
├── xperience/ annotations/ tracks/
├── stereo4d/ annotations/ track_index/
├── droid/ annotations/ tracks/ camera/ # robot teleop, NOT used by the default recipe
└── molmospaces/ annotations/ tracks/ camera/ videos/ # sim, NOT used by the default recipe
$POINTMOTIONBENCH_ROOT/
├── hot3d/
├── worldtrack/
└── davis/
The five datasets above the line are the ones the public training recipe uses (see Training). DROID and MolmoSpaces ship under the same root for users who want to extend the recipe, but the bundled recipe does not touch them.
Most datasets ship annotations + tracks + per-frame camera; the raw videos
(and, per dataset, some derived signals) are license-restricted and are
reconstructed locally from each subset's original source. Each dataset
directory on HuggingFace includes its own README.md and reconstruction
script — see Data and benchmark construction.
All released checkpoints are the autoregressive (AR) variant of MolmoMotion.
| Model | History H | Future F | HuggingFace |
|---|---|---|---|
| MolmoMotion-4B-H3-F30 | 3 | 30 | allenai/MolmoMotion-4B-H3-F30 |
| MolmoMotion-4B-H1-F32 | 1 | 32 | allenai/MolmoMotion-4B-H1-F32 |
hf download allenai/MolmoMotion-4B-H3-F30 \
--local-dir checkpoints/MolmoMotion-4B-H3-F30Pick H=3 / F=30 for typical video use (3 history frames, predict 2 seconds at 15 fps). Pick H=1 / F=32 when only a single query keyframe is available.
Stage-1 training (see Training) starts from the
Molmo2-4B-Pretrain checkpoint — the pretrain stage of
Molmo2, released by Ai2
alongside the Molmo2 codebase. Download URL is published in the Molmo2 README's
Checkpoints table:
wget https://storage.googleapis.com/oe-training-public/Molmo2-1225/Molmo2-4B-Pretrain.tar
tar -xvf Molmo2-4B-Pretrain.tar
# The extracted folder is what `/path/to/Molmo2-4B-Pretrain` refers to in Stage 1.oe-training-public is an unauthenticated GCS bucket, so the wget
above works without any gcloud credentials.
Below we run a single forward pass on a bundled clip and read the
(P, F, 3) future trajectory. Expected wall-clock on a single 80 GB A100:
~110 s for checkpoint load + ~40 s for predict_trajectory().
import torch
from PIL import Image
from molmo_motion import MolmoMotion, MolmoMotionProcessor
CKPT = "allenai/MolmoMotion-4B-H3-F30"
# 1. Load model + matching processor.
processor = MolmoMotionProcessor.from_pretrained(CKPT)
model = MolmoMotion.from_pretrained(CKPT)
model._internal = model._internal.to(torch.bfloat16).cuda() # 4B params
# 2. Build one inference example. With the H=3 model:
# history_frames — three PIL images, ordered earliest → t₀
# points_2d_at_t0 — (P, 2) tensor of pixel coords at t₀
# points_3d_history — (H, P, 3) tensor in camera-frame-at-t₀
# action — short action description
# future_horizon — number of future frames to predict
EXAMPLE_DIR = "examples/data/quickstart/davis_car_turn"
history_frames = [
Image.open(f"{EXAMPLE_DIR}/frame_t-2.jpg").convert("RGB"),
Image.open(f"{EXAMPLE_DIR}/frame_t-1.jpg").convert("RGB"),
Image.open(f"{EXAMPLE_DIR}/frame_t+0.jpg").convert("RGB"),
]
points_2d_at_t0 = torch.load(f"{EXAMPLE_DIR}/points_2d_at_t0.pt")
points_3d_history = torch.load(f"{EXAMPLE_DIR}/points_3d_history.pt")
action = open(f"{EXAMPLE_DIR}/caption.txt").read().strip()
inputs = processor(
history_frames=history_frames,
points_2d_at_t0=points_2d_at_t0,
points_3d_history=points_3d_history,
action=action,
future_horizon=30,
)
inputs = {k: v.cuda() if torch.is_tensor(v) else v for k, v in inputs.items()}
# 3. Forward.
with torch.inference_mode(), torch.autocast("cuda", dtype=torch.bfloat16):
out = model.predict_trajectory(**inputs)
# 4. `out.future_3d` is the decoded prediction: a (P=8, F=30, 3) tensor of
# absolute camera-frame XYZ coordinates in meters, one row per future
# frame, for each of the 8 query points (no further parsing needed —
# `predict_trajectory` already turned the raw `<tracks>` block into
# floats and added the anchor back).
future_3d = out.future_3d.cpu().numpy() # (8, 30, 3), meters
print(f"future_3d.shape: {future_3d.shape}")
# Per-point predicted positions at the first future frame (= t₀ + 1):
for pi in range(future_3d.shape[0]):
x, y, z = future_3d[pi, 0]
print(f" point {pi}: (x={x:+.3f}, y={y:+.3f}, z={z:+.3f}) m")
# Point 0's full predicted trajectory across all 30 future frames:
print(f"point 0 trajectory (F=30): {future_3d[0].round(3).tolist()}")MOLMO_MOTION_1M_ROOT and POINTMOTIONBENCH_ROOT only land annotations,
tracks, and (for most datasets) camera when downloaded — the raw videos and
a few derived signals are license-restricted and rebuilt locally. The
recipes live in three places; each per-dataset README.md is
authoritative.
| Where | What it provides |
|---|---|
allenai/molmo-motion-1m (HF) |
Per-dataset reconstruction for MOLMO_MOTION_1M_ROOT — each <dataset>/ ships its README.md + reconstruct_*.py alongside the annotations (EgoDex, YT-VIS, HD-EPIC, Xperience, Stereo4D, DROID, MolmoSpaces). See dataset_recipes/ for the pointer. |
pointmotionbench/ |
Per-subset reconstruction for POINTMOTIONBENCH_ROOT (DAVIS / HOT3D / WorldTrack) |
data_generation/ |
The pipeline code that annotates new raw videos with the same 3D track annotation schema |
After following the per-dataset READMEs (reconstruction adds videos/
and, per dataset, the remaining derived signals), the two roots look like:
$MOLMO_MOTION_1M_ROOT/
├── egodex/ annotations/ tracks/ camera/ videos/
├── ytvis/ ...
├── hdepic/ ...
├── xperience/ ...
├── stereo4d/ ...
├── droid/ ...
└── molmospaces/ ...
$POINTMOTIONBENCH_ROOT/
├── davis/
├── hot3d/
└── worldtrack/
Training reads from $MOLMO_MOTION_1M_ROOT; eval reads from
$POINTMOTIONBENCH_ROOT. No glue beyond setting the env vars.
Stereo4D heads-up. The HuggingFace download ships only a
track_index/for Stereo4D;tracks/andcamera/are both rebuilt locally. Run$MOLMO_MOTION_1M_ROOT/stereo4d/reconstruct_tracks.py(perstereo4d/README.md) beforescripts/build_track_keys_cache.py, otherwise the cache builder reports all 23,011 Stereo4D entries as missing NPZs.
MolmoMotion is trained in two stages. Both stages share the public training mix — the five human-video datasets in MolmoMotion-1M (EgoDex, YT-VIS, HD-EPIC, Xperience, Stereo4D); DROID and MolmoSpaces are excluded by default. Both stages start from a Molmo2-4B-Pretrain checkpoint as the VLM backbone.
We assume the corpus is downloaded as described in Downloading the Dataset and Benchmark and the env vars are exported.
export MOLMO_MOTION_1M_ROOT=/your/path/to/molmo-motion-1m
export POINTMOTIONBENCH_ROOT=/your/path/to/PointMotionBenchBefore the first run, build the track-keys cache (a one-time scan that lets the loader drop split entries whose NPZ keys diverged upstream):
python scripts/build_track_keys_cache.pyThe training recipes log to Weights & Biases, so export your project and entity before launching (the run aborts with an error if either is unset):
export WANDB_PROJECT=molmo-motion
export WANDB_ENTITY=<your-wandb-entity>
# or disable logging entirely: export WANDB_MODE=disabledTrain on the five human-video datasets with sqrt-frequency mixing
(p_i ∝ √N_i) — this is the recipe the released H3-Pretrain model uses.
torchrun --nproc-per-node=8 launch_scripts/sft.py \
/path/to/Molmo2-4B-Pretrain \
trajectory_3d_human_p8_h3_f8 \
--save_folder=checkpoints/MolmoMotion-Stage1 \
--model.mm_preprocessor.video.max_frames=3 \
--model.mm_preprocessor.image.max_crops=1 \
--seq_len=2560 \
--model.llm.max_sequence_length=2560 \
--device_batch_size=2 \
--max_duration=40000 \
--save_interval=2000 \
--eval_interval=5000Recipe summary:
| Field | Value |
|---|---|
| Backbone init | Molmo2-4B-Pretrain |
| Dataset name | trajectory_3d_human_p8_h3_f8 |
| Datasets in mix | egodex, ytvis, hepic, xperience, stereo4d |
| Mixing | sqrt-frequency |
| Points P | 8 |
| History H | 3 |
| Future F | 8 |
| Steps | 40,000 |
| Compute (released) | 16 GPUs (2 nodes × 8 GPUs) |
| Seq-len | 2560 |
| Precision | bf16 + FSDP2 |
The _human token expands to the 5-dataset mix above. To train on a custom
subset, list datasets explicitly:
# egodex only
trajectory_3d_egodex_p8_h3_f8
# 3-dataset ablation
trajectory_3d_egodex_xperience_hepic_p8_h3_f8Continue from the Stage-1 checkpoint with a longer future horizon. Two flavors are released, differing only in the history length:
# H=3, F=30 (typical 3-frame video setting)
torchrun --nproc-per-node=8 launch_scripts/sft.py \
checkpoints/MolmoMotion-Stage1/step40000 \
trajectory_3d_human_p8_h3_f30 \
--save_folder=checkpoints/MolmoMotion-H3-F30 \
--model.mm_preprocessor.video.max_frames=3 \
--model.mm_preprocessor.image.max_crops=1 \
--seq_len=6144 \
--model.llm.max_sequence_length=6144 \
--device_batch_size=2 \
--max_duration=10000 \
--save_interval=1000 \
--eval_interval=2500
# H=1, F=32 (single-keyframe setting)
torchrun --nproc-per-node=8 launch_scripts/sft.py \
checkpoints/MolmoMotion-Stage1/step40000 \
trajectory_3d_human_p8_h1_f32 \
--save_folder=checkpoints/MolmoMotion-H1-F32 \
--model.mm_preprocessor.video.max_frames=1 \
--model.mm_preprocessor.image.max_crops=1 \
--seq_len=6144 \
--model.llm.max_sequence_length=6144 \
--device_batch_size=2 \
--max_duration=10000 \
--save_interval=1000 \
--eval_interval=2500Stage 2 uses the same five-dataset mix as Stage 1; only the future horizon F and (for the H=1 variant) the history length H change. Every annotated clip in the five datasets contributes to training — there is no held-out human-corpus split. Evaluation is run separately on PointMotionBench, which never overlaps with training data.
We evaluate on PointMotionBench (HOT3D + WorldTrack + DAVIS) following the same setup as the paper:
- Predict future motion for up to 2 seconds at 15 fps (F=30) — or F=32 for the H=1 release.
- If the clip is shorter, evaluate only on the valid future frames.
- Use all annotated query points per clip (not just 8): the model
predicts in
P=8chunks, with metrics averaged across all points so the comparison is fair across model sizes. - Best-of-1 deterministic decoding (greedy) by default; the paper reports
best-of-5 numbers, see
--n_samples=5to reproduce.
# H=3, F=30 model
torchrun --nproc-per-node=8 launch_scripts/eval_pointmotionbench.py \
checkpoints/MolmoMotion-H3-F30/step10000 \
--benchmarks hot3d,worldtrack,davis \
--all_points \
--fixed_t0 \
--device_batch_size=2 \
--output_dir eval_out/MolmoMotion-H3-F30
# H=1, F=32 model
torchrun --nproc-per-node=8 launch_scripts/eval_pointmotionbench.py \
checkpoints/MolmoMotion-H1-F32/step10000 \
--benchmarks hot3d,worldtrack,davis \
--all_points \
--fixed_t0 \
--device_batch_size=2 \
--output_dir eval_out/MolmoMotion-H1-F32Flag semantics:
| Flag | Meaning |
|---|---|
--all_points |
Don't sub-sample 8 query points per clip — chunk all annotated points into groups of P=8 and average the metric across chunks. |
--fixed_t0 |
Pin the query frame at t = H − 1 so eval is deterministic across runs. Without this, t₀ is randomized per clip. |
--n_samples N |
Take the best of N samples per example (paper uses 5). Default 1. |
Output:
eval_out/MolmoMotion-H3-F30/
├── hot3d/
│ ├── predictions.json # per-example {gt, pred, video_id, point_idx, ...}
│ └── metrics.json # ADE / FDE / PWT aggregates
├── worldtrack/...
├── davis/...
└── summary.json # one-page rollup
summary.json reproduces the table format used in the paper:
{
"hot3d": {"ADE": 0.109, "FDE": 0.217, "PWT": 0.444, "n_clips": 2475},
"worldtrack": {"ADE": 0.143, "FDE": 0.261, "PWT": 0.445, "n_clips": 155},
"davis": {"ADE": 1.227, "FDE": 2.108, "PWT": 0.153, "n_clips": 90}
}All three are computed in 3D camera-frame meters and restricted to visibility-masked frames (padded futures for short clips are excluded).
- ADE (Average Displacement Error, ↓) — mean L2 error across all
visible query points and all predicted timesteps:
ADE = mean_{n,t}( ||p̂_t^n − p_t^n||_2 ) - FDE (Final Displacement Error, ↓) — L2 error at the final
predicted timestep:
FDE = mean_n( ||p̂_T^n − p_T^n||_2 ) - PWT (Points Within Threshold, ↑) — average fraction of predicted
points within
{0.01, 0.02, 0.05, 0.10, 0.20}meters of ground truth, averaged across thresholds:PWT = mean_{n,t,τ}( ||p̂_t^n − p_t^n||_2 ≤ τ )
Predicted 3D point trajectories on PointMotionBench across diverse motion patterns and language instructions. See Section 4 of the paper for the full quantitative comparison.
Convert an OLMo-native unsharded checkpoint into a
AutoModelForImageTextToText-loadable HF directory:
molmo-motion-convert-hf \
checkpoints/MolmoMotion-H3-F30/step10000-unsharded \
hf_export/MolmoMotion-H3-F30 \
--use_bfloat16This produces:
hf_export/MolmoMotion-H3-F30/
├── config.json
├── generation_config.json
├── model-00001-of-00002.safetensors
├── model-00002-of-00002.safetensors
├── model.safetensors.index.json
├── modeling_molmo_motion.py # bundled for `trust_remote_code=True`
├── configuration_molmo_motion.py
├── processing_molmo_motion.py
├── image_processing_molmo_motion.py
├── video_processing_molmo_motion.py
├── preprocessor_config.json
├── tokenizer_config.json
└── ... # tokenizer files
Push:
hf upload allenai/MolmoMotion-4B-H3-F30 \
hf_export/MolmoMotion-H3-F30 .Use a MolmoMotion checkpoint as the initialization for a
MolmoBot manipulation policy and
evaluate it on the
MolmoSpaces Franka pick-and-
place benchmark. The robotics/ subdirectory contains the
recipe.
See robotics/README.md for the full walkthrough.
MolmoMotion is trained on MolmoMotion-1M, which includes data derived from the Xperience dataset. We thank Ropedia for Xperience.
Disclaimer: The Xperience-derived data is subject to Ropedia's terms and conditions. Users who access or reconstruct that portion of the data must review and comply with the terms on the Xperience dataset page.
@inproceedings{molmomotion2026,
title = {MolmoMotion: Forecasting Point Trajectories in 3D with Language Instruction},
author = {<author list TBD>},
booktitle = {<venue TBD>},
year = {2026}
}Code: Apache 2.0. Trained model weights: Apache 2.0. Datasets carry their respective upstream licenses — see the per-dataset README inside allenai/molmo-motion-1m and allenai/PointMotionBench.
The vendored third-party models under
data_generation/third_party/ carry their
own licenses, which are not all Apache 2.0: SAM 3 ships under Meta's
SAM License, and ViPE's dependency stack includes UniDepth under
CC BY-NC 4.0 (non-commercial). See the LICENSE / THIRD_PARTY_LICENSES
files in each vendored directory before commercial use of the
data-generation pipeline.
