perf(processor): reduce GPU-CPU sync in action tokenization by jashshah999 · Pull Request #3123 · huggingface/lerobot

jashshah999 · 2026-03-10T07:27:22Z

What this does

Reduces GPU-CPU synchronization overhead in ActionTokenizerProcessorStep._tokenize_action(), which is called every training step for tokenizer-based policies (SmolVLA, xVLA).

The problem

The tokenization loop was calling action[i:i+1].cpu() and tokens.to(device) per sample, creating 2 * batch_size GPU-CPU synchronization points per training step. For a typical batch_size=64, that's 128 sync stalls per step.

Additionally, constant token sequences (bos_token_id, encode("Action: "), encode("|")) were recomputed every iteration.

This is likely a contributing factor to issue #1488 (SmolVLA training much slower than ACT).

Changes

Move entire action batch to CPU in one transfer before the loop (1 sync instead of batch_size)
Build all token tensors on CPU, stack, then transfer to GPU once at the end (1 sync instead of batch_size)
Cache constant prefix/suffix token sequences in __post_init__

Net effect: GPU-CPU sync reduced from O(batch_size) to O(1) per call.

Related: #1488

The _tokenize_action loop was calling action[i].cpu() and then tokens.to(device) per sample, creating 2*batch_size GPU-CPU synchronization points per training step. For batch_size=64, that is 128 sync stalls per step. Changes: - Move entire batch to CPU in one transfer before the loop - Build all token tensors on CPU, stack, then transfer once to GPU - Cache constant prefix/suffix token sequences (bos + "Action: " and "|") in __post_init__ instead of recomputing every iteration This reduces GPU-CPU sync from O(batch_size) to O(1) per call.

github-actions bot added the processor Issue related to processor label Mar 10, 2026

imstevenpmwork self-assigned this Mar 10, 2026

imstevenpmwork mentioned this pull request Mar 11, 2026

Dataloader is blazingly fast for ACT training but VERY slow for SmolVLA and Diffusion Policy training #1488

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(processor): reduce GPU-CPU sync in action tokenization#3123

perf(processor): reduce GPU-CPU sync in action tokenization#3123
jashshah999 wants to merge 1 commit intohuggingface:mainfrom
jashshah999:perf/batch-tokenizer-gpu-sync

jashshah999 commented Mar 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jashshah999 commented Mar 10, 2026

What this does

The problem

Changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants