[Model Runner V2] Introduce num_tokens_for_attn by WoosukKwon · Pull Request #36815 · vllm-project/vllm

WoosukKwon · 2026-03-11T19:48:19Z

No description provided.

Signed-off-by: Woosuk Kwon <woosuk@inferact.ai>

mergify · 2026-03-11T19:52:13Z

Hi @WoosukKwon, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

gemini-code-assist

Code Review

This pull request introduces a new field, num_tokens_for_attn, to BatchExecutionDescriptor and InputBatch to allow for a more precise specification of the number of tokens used in attention mechanisms, particularly for CUDA graphs. The changes are consistently applied across cudagraph_utils, dp_utils, model_runner, and model_states to propagate and utilize this new field. The implementation appears correct and well-integrated.

gemini-code-assist

Code Review

This pull request introduces a new field, num_tokens_for_attn, to BatchExecutionDescriptor and InputBatch. This field is used to specify the exact number of tokens that should be processed by the attention mechanism, which can differ from the total number of tokens in a batch, particularly when using CUDA graphs for decode operations. The changes are consistently propagated through cudagraph_utils.py, dp_utils.py, input_batch.py, model_runner.py, and model_states/default.py. This refactoring centralizes the logic for determining the attention token count and simplifies the prepare_attn function. The implementation appears correct and robust, with no high or critical issues found.

njhill

we probably wanna rename these: num_actual_tokens <= num_attn_tokens <= num_input_tokens

agree and let's document the above relationship explicitly in comment too :)

njhill · 2026-03-11T19:52:34Z

vllm/v1/worker/gpu/cudagraph_utils.py

+                    # i.e. no request padding is needed
+                    # so we leave it as None


save a line?

Suggested change

# i.e. no request padding is needed

# so we leave it as None

# i.e. no request padding is needed, so we leave it as None

njhill · 2026-03-11T19:53:28Z

vllm/v1/worker/gpu/cudagraph_utils.py

    num_tokens: int
+    num_tokens_for_attn: int | None
    num_reqs: int | None  # None means no request padding is needed (PIECEWISE graphs)
    uniform_token_count: int | None = None


I think we should add more doc to these fields

including meaning of None for the other ones too

njhill · 2026-03-11T19:58:38Z

vllm/v1/worker/gpu/model_runner.py

+        if batch_desc.num_tokens_for_attn is not None:
+            num_tokens_for_attn = batch_desc.num_tokens_for_attn
+        else:
+            num_tokens_for_attn = num_tokens


could simplify

Suggested change

if batch_desc.num_tokens_for_attn is not None:

num_tokens_for_attn = batch_desc.num_tokens_for_attn

else:

num_tokens_for_attn = num_tokens

num_tokens_for_attn = batch_desc.num_tokens_for_attn or num_tokens

njhill · 2026-03-11T20:03:23Z

Also would be good to make sure the CI tests cover this

[Model Runner V2] Introduce num_tokens_for_attn

7127e49

Signed-off-by: Woosuk Kwon <woosuk@inferact.ai>

WoosukKwon requested a review from njhill as a code owner March 11, 2026 19:48

mergify bot added nvidia v1 labels Mar 11, 2026

github-project-automation bot added this to NVIDIA Mar 11, 2026

gemini-code-assist bot reviewed Mar 11, 2026

View reviewed changes

njhill approved these changes Mar 11, 2026

View reviewed changes

github-project-automation bot moved this to Ready in NVIDIA Mar 11, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Model Runner V2] Introduce num_tokens_for_attn#36815

[Model Runner V2] Introduce num_tokens_for_attn#36815
WoosukKwon wants to merge 1 commit intomainfrom
woosuk/mrv2-cudagraph-attn-fix

WoosukKwon commented Mar 11, 2026

Uh oh!

mergify bot commented Mar 11, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

njhill left a comment

Uh oh!

njhill Mar 11, 2026

Uh oh!

njhill Mar 11, 2026

Uh oh!

njhill Mar 11, 2026

Uh oh!

njhill commented Mar 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	# i.e. no request padding is needed
	# so we leave it as None
	# i.e. no request padding is needed, so we leave it as None

Uh oh!

Conversation

WoosukKwon commented Mar 11, 2026

Uh oh!

mergify bot commented Mar 11, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

njhill left a comment

Choose a reason for hiding this comment

Uh oh!

njhill Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

njhill Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

njhill Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

njhill commented Mar 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants