Skip to content

Latest commit

 

History

History
556 lines (425 loc) · 23.4 KB

File metadata and controls

556 lines (425 loc) · 23.4 KB

GR00T Deployment & Inference Guide

Run inference with PyTorch or TensorRT acceleration for the GR00T N1.7 policy.


Prerequisites

  • Model checkpoint: nvidia/GR00T-N1.7-3B
  • Dataset in LeRobot format (e.g., demo_data/libero_demo)
  • CUDA-enabled GPU
  • Setup uv environment following README.md
Platform Installation
dGPU (H100, A100, RTX 4090/5090, L20, RTX Pro 5000/6000, etc.) uv sync — GPU deps (flash-attn, onnx, tensorrt) included
Jetson Thor Jetson Thor Setup (Docker or bare metal)
DGX Spark DGX Spark Setup (Docker or bare metal)
Jetson Orin Jetson Orin Setup (Docker or bare metal)
  • dGPU local environment: use the installation commands below, then use the PyTorch or TensorRT commands in this guide
  • Thor Docker or bare metal: skip to Jetson Thor Setup
  • Spark Docker or bare metal: skip to DGX Spark Setup
  • Orin Docker or bare metal: skip to Jetson Orin Setup

dGPU Installation

uv sync

GPU dependencies (flash-attn, onnx, tensorrt) are included in the default install.

Download Model and Dataset

Download the finetuned model to a local directory (HuggingFace does not support nested repo paths directly):

uv run hf download nvidia/GR00T-N1.7-LIBERO \
  --include "libero_10/config.json" "libero_10/embodiment_id.json" \
  "libero_10/model-*.safetensors" "libero_10/model.safetensors.index.json" \
  "libero_10/processor_config.json" "libero_10/statistics.json" \
  --local-dir checkpoints/GR00T-N1.7-LIBERO

For demo dataset setup, see the Data Format section in the main README.


Quick Start: PyTorch Inference

Run inference on demo trajectories using PyTorch (no TRT setup needed):

uv run python scripts/deployment/standalone_inference_script.py \
  --model-path checkpoints/GR00T-N1.7-LIBERO/libero_10 \
  --dataset-path demo_data/libero_demo \
  --embodiment-tag LIBERO_PANDA \
  --traj-ids 0 1 2 3 4 \
  --inference-mode pytorch \
  --action-horizon 8

TensorRT Acceleration

The trt_full_pipeline mode (passed via --inference-mode trt_full_pipeline in standalone_inference_script.py) accelerates all model components with TRT engines. Speedup varies by platform — see benchmark tables below for measured results on each device. The same pipeline is referred to as n17_full_pipeline inside the engine-loading and build scripts (trt_model_forward.py, build_trt_pipeline.py); the two names describe the same set of engines.

Component Engine Notes
ViT TRT Qwen3-VL Vision (24 blocks, FP32 for accuracy)
LLM TRT Qwen3-VL Text Model (16 layers, with deepstack injection)
VL Self-Attention TRT SelfAttentionTransformer (4 layers, if present)
State Encoder TRT CategorySpecificMLP
Action Encoder TRT MultiEmbodimentActionEncoder
DiT TRT AlternateVLDiT (32 layers)
Action Decoder TRT CategorySpecificMLP

Lightweight ops remain in PyTorch: embed_tokens, masked_scatter, get_rope_index, VLLN.

DiT-only mode (legacy from N1.6)

The dit_only export mode (--export-mode dit_only) optimizes only the action head DiT, leaving the backbone in PyTorch. This was the default in N1.6. For N1.7, full_pipeline is recommended as it accelerates the backbone (ViT + LLM) which dominates inference time.

Build TRT Engines

The unified build_trt_pipeline.py script runs all steps (export ONNX → build engines → verify accuracy → benchmark) in a single command:

uv run python scripts/deployment/build_trt_pipeline.py \
  --model-path checkpoints/GR00T-N1.7-LIBERO/libero_10 \
  --dataset-path demo_data/libero_demo \
  --embodiment-tag LIBERO_PANDA

Finetuned models: Replace --model-path with your checkpoint path. The pipeline is identical for base and finetuned models.

Note: Engine build takes ~2-5 minutes depending on GPU. Engines are GPU-architecture-specific and must be rebuilt for different GPUs.

Batch size: The --batch-size value is baked as a static dimension into the ONNX and TRT models. Engines built with one batch size cannot be used with a different batch size at runtime. If you need a different batch size, re-run the full pipeline (--steps export,build,verify) with the new --batch-size value.

You can also run a subset of steps:

# Export + build only (skip verify and benchmark)
uv run python scripts/deployment/build_trt_pipeline.py \
  --model-path checkpoints/GR00T-N1.7-LIBERO/libero_10 \
  --dataset-path demo_data/libero_demo \
  --embodiment-tag LIBERO_PANDA \
  --steps export,build
What each step does

The pipeline runs 4 steps in sequence:

  1. Export to ONNX (export) — Exports all model components (LLM, VL Self-Attention, State Encoder, Action Encoder, DiT, Action Decoder) to ONNX format under <output-dir>/onnx/.
  2. Build TensorRT Engines (build) — Compiles each ONNX model into a GPU-specific TensorRT engine under <output-dir>/engines/.
  3. Verify Accuracy (verify) — Runs PyTorch vs TRT output comparison. Expected: Cosine Similarity: 0.999+ (PASS).
  4. Benchmark (benchmark) — Measures E2E latency for PyTorch Eager, torch.compile, and TRT modes.

Each step can be run individually via --steps <step>. Verbose logs are written to <output-dir>/pipeline.log.


Performance

Benchmark Results

GR00T N1.7 Inference Timing (4 denoising steps, 1 camera):

Device Mode Data Processing Backbone Action Head E2E Frequency E2E Speedup
dGPU
H100 80GB HBM3 PyTorch Eager 6.2 ms 31.3 ms 48.2 ms 85.8 ms 11.7 Hz 1.00x
torch.compile 6.2 ms 30.4 ms 12.0 ms 48.6 ms 20.6 Hz 1.77x
TensorRT (Full Pipeline) 6.2 ms 8.8 ms 12.3 ms 27.9 ms 35.9 Hz 3.08x
H20 96GB HBM3 PyTorch Eager 5.33 ms 30.8 ms 47.3 ms 83.4 ms 12.0 Hz 1.00x
torch.compile 5.33 ms 31.1 ms 13.3 ms 49.7 ms 20.1 Hz 1.68x
TensorRT (Full Pipeline) 5.33 ms 14.2 ms 14.5 ms 34.0 ms 29.4 Hz 2.45x
RTX Pro 6000 Blackwell PyTorch Eager 4.8 ms 29.3 ms 44.0 ms 78.4 ms 12.8 Hz 1.00x
torch.compile 4.8 ms 29.4 ms 16.5 ms 50.7 ms 19.7 Hz 1.55x
TensorRT (Full Pipeline) 4.8 ms 9.9 ms 13.2 ms 27.9 ms 35.9 Hz 2.81x
RTX Pro 5000 72GB PyTorch Eager 8.85 ms 54.01 ms 63.19 ms 126.4 ms 7.9 Hz 1.00x
torch.compile 8.85 ms 55.74 ms 20.38 ms 84.9 ms 11.8 Hz 1.49x
TensorRT (Full Pipeline) 8.85 ms 14.37 ms 17.33 ms 40.5 ms 24.7 Hz 3.13x
L40 PyTorch Eager 6.6 ms 42.8 ms 78.9 ms 128.3 ms 7.8 Hz 1.00x
torch.compile 6.6 ms 42.7 ms 19.8 ms 69.0 ms 14.5 Hz 1.86x
TensorRT (Full Pipeline) 6.6 ms 13.1 ms 18.8 ms 38.4 ms 26.0 Hz 3.34x
L20 PyTorch Eager 5.7 ms 47.58 ms 86.92 ms 140.3 ms 7.1 Hz 1.00x
torch.compile 5.7 ms 47.2 ms 20.18 ms 73.1 ms 13.7 Hz 1.92x
TensorRT (Full Pipeline) 5.7 ms 17.27 ms 19.79 ms 42.8 ms 23.3 Hz 3.28x
Jetson / Spark
DGX Spark PyTorch Eager 13.14 ms 38.22 ms 74.94 ms 126.4 ms 7.9 Hz 1.00x
torch.compile 13.14 ms 39.23 ms 56.49 ms 108.8 ms 9.2 Hz 1.16x
TensorRT (Full Pipeline) 13.14 ms 33.43 ms 52.37 ms 98.6 ms 10.1 Hz 1.28x
AGX Thor PyTorch Eager 8.21 ms 55.26 ms 81.65 ms 144.9 ms 6.9 Hz 1.00x
torch.compile 8.21 ms 55.59 ms 64.66 ms 128.4 ms 7.8 Hz 1.13x
TensorRT (Full Pipeline) 8.21 ms 28.89 ms 56.64 ms 93.8 ms 10.7 Hz 1.54x
Orin PyTorch Eager 9.45 ms 127.6 ms 205.39 ms 342.8 ms 2.9 Hz 1.00x
torch.compile 9.45 ms 128.59 ms 78.94 ms 217.0 ms 4.6 Hz 1.58x
TensorRT (DiT-only) 9.45 ms 128.38 ms 78.6 ms 216.5 ms 4.6 Hz 1.58x

Note: Orin uses DiT-only TensorRT (--inference-mode tensorrt) because TRT 10.3 does not support the backbone engine. All other platforms use the full pipeline (--inference-mode trt_full_pipeline).

Raw benchmark output (H100 80GB HBM3)
Hardware: NVIDIA H100 80GB HBM3
Model: checkpoints/GR00T-N1.7-LIBERO/libero_10
1 camera, Denoising Steps: 4

PyTorch Eager:
  E2E:             85.8 ms (11.7 Hz)
  Data Processing: 6.2 ms | Backbone: 31.3 ms | Action Head: 48.2 ms

torch.compile:
  E2E:             48.6 ms (20.6 Hz), 1.77x speedup
  Data Processing: 6.2 ms | Backbone: 30.4 ms | Action Head: 12.0 ms

TensorRT (Full Pipeline):
  E2E:             27.9 ms (35.9 Hz), 3.08x speedup
  Data Processing: 6.2 ms | Backbone: 8.8 ms  | Action Head: 12.3 ms

Standalone Inference with TRT

The standalone inference script serves as both an accuracy validation and a reference for deploying TRT inference in your own code. It runs per-step inference on real trajectories and compares action predictions:

uv run python scripts/deployment/standalone_inference_script.py \
  --model-path checkpoints/GR00T-N1.7-LIBERO/libero_10 \
  --dataset-path demo_data/libero_demo \
  --embodiment-tag LIBERO_PANDA \
  --traj-ids 0 1 2 3 4 \
  --inference-mode trt_full_pipeline \
  --trt-engine-path ./gr00t_trt_deployment/engines \
  --save-plot-path ./output/trt_inference.png

Expected accuracy: MSE/MAE match PyTorch within noise. TRT produces identical action quality. Speedup varies by platform — run build_trt_pipeline.py --steps benchmark on your hardware for exact numbers.

Optional: LIBERO Closed-Loop Sim Evaluation

To validate TRT accuracy in end-to-end robotic tasks, run the LIBERO closed-loop evaluation. This requires a separate environment setup (~10-30 min, MuJoCo simulator + dependencies).

Setup, commands, and results (H100, 20 episodes)

Task: KITCHEN_SCENE3_turn_on_the_stove_and_put_the_moka_pot_on_it, 20 episodes:

Mode Success Rate
PyTorch 100% (20/20)
TRT (n17_full_pipeline) 95% (19/20)

Difference is within simulation noise (p >> 0.05).

Note: Use --n-envs 1 for TRT evaluation (ViT engine has static shapes for single-observation inference).

# One-time LIBERO setup (~10 min)
bash gr00t/eval/sim/LIBERO/setup_libero.sh

# Activate LIBERO venv and install additional deps
source gr00t/eval/sim/LIBERO/libero_uv/.venv/bin/activate
uv pip install diffusers transformers accelerate safetensors torchcodec

# TRT full pipeline evaluation
python gr00t/eval/rollout_policy.py \
  --model-path checkpoints/GR00T-N1.7-LIBERO/libero_10 \
  --env-name "libero_sim/KITCHEN_SCENE3_turn_on_the_stove_and_put_the_moka_pot_on_it" \
  --n-episodes 20 --n-envs 1 --max-episode-steps 504 \
  --trt-engine-path ./gr00t_trt_deployment/engines \
  --trt-mode n17_full_pipeline

Run python scripts/deployment/build_trt_pipeline.py --steps benchmark to generate benchmarks for your hardware.


Platform-Specific Setup

Jetson and Spark platforms use different dependency stacks than dGPU. Thor and Spark use CUDA 13 with PyTorch 2.10.0 from the Jetson AI Lab cu130 index. Orin uses CUDA 12.6 with PyTorch 2.10.0 from the Jetson AI Lab cu126 index.

Jetson Thor Setup

Thor uses CUDA 13 and Python 3.12, which require a different dependency stack than x86 or Orin. Tested with JetPack 7.1. There are two ways to run on Thor: Docker (recommended) or bare metal.

Docker (Recommended)

Build the Thor container from the repo root:

cd docker && bash build.sh --profile=thor && cd ..

Download the finetuned model (run once, on the host):

uv run hf download nvidia/GR00T-N1.7-LIBERO --include "libero_10/config.json" "libero_10/embodiment_id.json" "libero_10/model-*.safetensors" "libero_10/model.safetensors.index.json" "libero_10/processor_config.json" "libero_10/statistics.json" --local-dir checkpoints/GR00T-N1.7-LIBERO

Start an interactive Docker session (recommended for multi-step TRT work):

docker run -it --rm --runtime nvidia --gpus all \
  --ipc=host \
  --ulimit memlock=-1 \
  --ulimit stack=67108864 \
  --network host \
  -v "$(pwd)":/workspace/repo \
  -v "${HF_HOME:-${HOME}/.cache/huggingface}":/root/.cache/huggingface \
  -w /workspace/repo \
  -e HF_TOKEN="${HF_TOKEN:-}" \
  gr00t-thor \
  bash

Then inside the container, run the full TRT pipeline (export, build, verify, benchmark):

python scripts/deployment/build_trt_pipeline.py \
  --model-path checkpoints/GR00T-N1.7-LIBERO/libero_10 \
  --dataset-path demo_data/libero_demo \
  --embodiment-tag LIBERO_PANDA
Bare Metal
# One-time install (temporarily copies the Thor pyproject.toml and uv.lock to repo root,
# installs NVPL libs, uv, Python deps, and builds torchcodec from source against the
# system FFmpeg runtime)
bash scripts/deployment/thor/install_deps.sh

# In each new shell
source .venv/bin/activate
source scripts/activate_thor.sh

Then run the TRT pipeline or PyTorch inference as shown in the TensorRT Acceleration and Quick Start sections above. The activation script exports the PyTorch and CUDA library/include paths that torchcodec and torch.compile need on Thor.


DGX Spark Setup

Spark uses CUDA 13 and Python 3.12 like Thor, but requires a dedicated dependency stack and source-built flash-attn for sm121. There are two ways to run on Spark: Docker (recommended) or bare metal.

Docker (Recommended)

Build the Spark container from the repo root:

cd docker && bash build.sh --profile=spark && cd ..

Download the finetuned model (run once, on the host):

uv run hf download nvidia/GR00T-N1.7-LIBERO --include "libero_10/config.json" "libero_10/embodiment_id.json" "libero_10/model-*.safetensors" "libero_10/model.safetensors.index.json" "libero_10/processor_config.json" "libero_10/statistics.json" --local-dir checkpoints/GR00T-N1.7-LIBERO

Start an interactive Docker session (recommended for multi-step TRT work):

docker run -it --rm --runtime nvidia --gpus all \
  --ipc=host \
  --ulimit memlock=-1 \
  --ulimit stack=67108864 \
  --network host \
  -v "$(pwd)":/workspace/repo \
  -v "${HF_HOME:-${HOME}/.cache/huggingface}":/root/.cache/huggingface \
  -w /workspace/repo \
  -e HF_TOKEN="${HF_TOKEN:-}" \
  gr00t-spark \
  bash

Then inside the container, run the full TRT pipeline (export, build, verify, benchmark):

python scripts/deployment/build_trt_pipeline.py \
  --model-path checkpoints/GR00T-N1.7-LIBERO/libero_10 \
  --dataset-path demo_data/libero_demo \
  --embodiment-tag LIBERO_PANDA
Bare Metal
# One-time install (temporarily copies the Spark pyproject.toml and uv.lock to repo root,
# installs NVPL libs, uv, Python deps, source-builds flash-attn for sm121, and builds
# torchcodec from source against the system FFmpeg runtime)
bash scripts/deployment/spark/install_deps.sh

# In each new shell
source .venv/bin/activate
source scripts/activate_spark.sh

Then run the TRT pipeline or PyTorch inference as shown in the TensorRT Acceleration and Quick Start sections above. If you later rerun uv sync, rerun bash scripts/deployment/spark/install_deps.sh so the Spark-specific flash-attn build is restored and revalidated.


Jetson Orin Setup

Note: On Orin, only the DiT (action head) TRT export is currently supported. Use --export-mode dit_only instead of full_pipeline. Full pipeline support is in progress.

Orin uses CUDA 12.6 and Python 3.10 (JetPack 6.2), which require a different dependency stack than x86 or Thor. Tested with JetPack 6.2. There are two ways to run on Orin: Docker (recommended) or bare metal.

Docker (Recommended)

Build the Orin container from the repo root:

cd docker && bash build.sh --profile=orin && cd ..

Download the finetuned model (run once, on the host):

uv run hf download nvidia/GR00T-N1.7-LIBERO --include "libero_10/config.json" "libero_10/embodiment_id.json" "libero_10/model-*.safetensors" "libero_10/model.safetensors.index.json" "libero_10/processor_config.json" "libero_10/statistics.json" --local-dir checkpoints/GR00T-N1.7-LIBERO

Start an interactive Docker session (recommended for multi-step TRT work):

docker run -it --rm --runtime nvidia --gpus all \
  --ipc=host \
  --ulimit memlock=-1 \
  --ulimit stack=67108864 \
  --network host \
  -v "$(pwd)":/workspace/repo \
  -v "${HF_HOME:-${HOME}/.cache/huggingface}":/root/.cache/huggingface \
  -w /workspace/repo \
  -e HF_TOKEN="${HF_TOKEN:-}" \
  gr00t-orin \
  bash

Then inside the container, run the TRT pipeline (DiT-only on Orin):

python scripts/deployment/build_trt_pipeline.py \
  --model-path checkpoints/GR00T-N1.7-LIBERO/libero_10 \
  --dataset-path demo_data/libero_demo \
  --embodiment-tag LIBERO_PANDA \
  --export-mode dit_only
Bare Metal
# One-time install (temporarily copies the Orin pyproject.toml and uv.lock to repo root,
# installs uv, Python deps, and builds torchcodec from source against JetPack's FFmpeg
# runtime)
bash scripts/deployment/orin/install_deps.sh

# In each new shell
source .venv/bin/activate
source scripts/activate_orin.sh

Then run the TRT pipeline (with --export-mode dit_only) or PyTorch inference as shown in the TensorRT Acceleration and Quick Start sections above. The activation script exports the PyTorch and CUDA library/include paths that torchcodec and torch.compile need on Orin.

Orin storage tip: If your eMMC root is low on space, redirect the HuggingFace cache to an NVMe SSD with export HF_HOME=/path/to/ssd/.cache/huggingface before downloading models.

Orin TRT limitations: TRT 10.3 on Orin does not support the backbone (LLM) engine — the build step will report a failure for llm_bf16.engine and that is expected. The remaining 6 engines build successfully. Use --export-mode action_head for verification and --inference-mode tensorrt (DiT-only TRT, backbone runs in PyTorch) for inference:

python scripts/deployment/build_trt_pipeline.py \
  --model-path checkpoints/GR00T-N1.7-LIBERO/libero_10 \
  --dataset-path demo_data/libero_demo \
  --export-mode action_head \
  --steps verify

python scripts/deployment/standalone_inference_script.py \
  --model-path checkpoints/GR00T-N1.7-LIBERO/libero_10 \
  --dataset-path demo_data/libero_demo \
  --embodiment-tag LIBERO_PANDA \
  --traj-ids 0 \
  --inference-mode tensorrt \
  --trt-engine-path ./gr00t_n1d7_engines

Command-Line Arguments

build_trt_pipeline.py

Argument Default Description
--model-path (required) Path to model checkpoint
--dataset-path demo_data/libero_demo Path to dataset (LeRobot format)
--embodiment-tag Auto-detected Embodiment tag (auto-detected from processor_config.json if single embodiment)
--output-dir ./gr00t_trt_deployment Root output directory. ONNX → <output-dir>/onnx/, engines → <output-dir>/engines/
--precision bf16 Precision for ONNX export and TRT engine build (bf16, fp16, fp32)
--batch-size 1 Batch size baked into exported ONNX/TRT models (static — see note below)
--export-mode full_pipeline Export mode: dit_only, action_head, or full_pipeline
--video-backend torchcodec Video backend for dataset loading
--workspace 8192 TRT builder workspace size in MB
--num-iterations 20 Number of benchmark iterations
--warmup 5 Number of warmup iterations
--skip-compile false Skip torch.compile benchmark
--steps all Steps to run: all or comma-separated subset of export,build,verify,benchmark
--log-file <output-dir>/pipeline.log Log file path

standalone_inference_script.py

Argument Default Description
--model-path (required) Path to model checkpoint
--dataset-path demo_data/droid_sample Path to dataset (LeRobot format)
--embodiment-tag Auto-detected Robot embodiment tag
--traj-ids [0] Episode indices to evaluate (space-separated)
--steps 200 Max steps per trajectory (capped by actual length)
--action-horizon 16 Action prediction horizon
--inference-mode pytorch pytorch, tensorrt (DiT-only TRT), or trt_full_pipeline (all engines)
--trt-engine-path ./gr00t_n1d7_engines Directory containing pre-built TRT engines
--denoising-steps 4 Diffusion denoising iterations
--save-plot-path None Save per-trajectory GT-vs-predicted comparison plots
--video-backend torchcodec Video decoder: torchcodec, decord, or torchvision_av
--skip-timing-steps 1 Initial steps excluded from timing stats (warmup)
--host / --port 127.0.0.1 / 5555 Server address (when using client mode without --model-path)
--seed 42 Random seed for reproducibility

Files

File Description
build_trt_pipeline.py Unified pipeline: export ONNX, build engines, verify, benchmark
standalone_inference_script.py Main inference script (PyTorch + DiT-only TensorRT)
trt_torch.py TRT Engine wrapper class (load, bind, execute)
trt_model_forward.py TRT forward functions and setup (backbone + action head)

Troubleshooting

Engine Build Fails

  • Ensure you have enough GPU memory (16GB+ recommended for full pipeline)
  • Try reducing workspace size: --workspace 4096
  • Ensure TensorRT version matches your CUDA version
  • LLM engine requires batch_size dimension handling when using custom shape profiles

ONNX Export Issues

  • If export fails with COMPLEX128 error: ensure _simple_causal_mask is used (not HuggingFace's create_causal_mask)
  • If masked_scatter size assertion fails: ensure visual_pos_masks has the correct number of True values matching deepstack tensor size
  • Check that the dataset path is valid and contains at least one trajectory

Accuracy Issues

  • If cosine < 0.99: check that LLM export does NOT include the final RMSNorm (backbone returns pre-norm hidden_states[-1])
  • If output magnitude is ~12x too small: this is the norm bug — see above
  • Run build_trt_pipeline.py --steps verify --export-mode action_head first to isolate backbone vs action head drift