Skip to content

perf(processor): reduce GPU-CPU sync in action tokenization#3123

Open
jashshah999 wants to merge 1 commit intohuggingface:mainfrom
jashshah999:perf/batch-tokenizer-gpu-sync
Open

perf(processor): reduce GPU-CPU sync in action tokenization#3123
jashshah999 wants to merge 1 commit intohuggingface:mainfrom
jashshah999:perf/batch-tokenizer-gpu-sync

Conversation

@jashshah999
Copy link
Contributor

What this does

Reduces GPU-CPU synchronization overhead in ActionTokenizerProcessorStep._tokenize_action(), which is called every training step for tokenizer-based policies (SmolVLA, xVLA).

The problem

The tokenization loop was calling action[i:i+1].cpu() and tokens.to(device) per sample, creating 2 * batch_size GPU-CPU synchronization points per training step. For a typical batch_size=64, that's 128 sync stalls per step.

Additionally, constant token sequences (bos_token_id, encode("Action: "), encode("|")) were recomputed every iteration.

This is likely a contributing factor to issue #1488 (SmolVLA training much slower than ACT).

Changes

  • Move entire action batch to CPU in one transfer before the loop (1 sync instead of batch_size)
  • Build all token tensors on CPU, stack, then transfer to GPU once at the end (1 sync instead of batch_size)
  • Cache constant prefix/suffix token sequences in __post_init__

Net effect: GPU-CPU sync reduced from O(batch_size) to O(1) per call.

Related: #1488

The _tokenize_action loop was calling action[i].cpu() and then
tokens.to(device) per sample, creating 2*batch_size GPU-CPU
synchronization points per training step. For batch_size=64, that
is 128 sync stalls per step.

Changes:
- Move entire batch to CPU in one transfer before the loop
- Build all token tensors on CPU, stack, then transfer once to GPU
- Cache constant prefix/suffix token sequences (bos + "Action: "
  and "|") in __post_init__ instead of recomputing every iteration

This reduces GPU-CPU sync from O(batch_size) to O(1) per call.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

processor Issue related to processor

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants