Skip to content

Conversation

@remi-or
Copy link
Collaborator

@remi-or remi-or commented Dec 12, 2025

Summary

This PR adds a few optimizations for continuous batching:

  • removes non-needed torch.cuda.synchronize
  • add sorting of the inputs to maximize prefix caching hits
  • sampling is done on the GPU
  • removed an extranuous axis from the output_ids

Performance

Attention Version Generated tokens Duration (s) Throughput (tok/s)
Flash attention 3 This PR 112599 16.73 6729.27
Flash attention 3 Main branch 111823 25.67 4355.68
Flash attention 2 This PR 112822 24.61 4584.74
Flash attention 2 Main branch 112126 33.12 3385.46
SDPA This PR 113254 82.49 1373.00
SDPA Main branch 113725 170.84 665.67

Tests

No new failures for the CB tests. The only two failing tests are the same as in #42699 because of compile not working with gemma2 which seems out of scope and acceptable.

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@aymeric-roucher
Copy link
Contributor

Hi sir just wanted to say this seems like a nice PR thanks

@remi-or remi-or requested a review from ArthurZucker December 15, 2025 09:40
@github-actions
Copy link
Contributor

View the CircleCI Test Summary for this PR:

https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=42839&sha=e45e17

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants