Skip to content

Conversation

@bbrowning
Copy link
Contributor

@bbrowning bbrowning commented Dec 14, 2025

Purpose

Full CUDA graph capture + replay (the default since vLLM v0.11.0) could lead to memory correctness issues in the custom_all_reduce implementation due to previous logic that skipped running the actual all_reduce operation during the compile and warmup phase. This resulted in compile and warmup seeing a memory allocation, but not observing the custom op call or the actual reduce operation which could lead to optmizations being taken assuming purely functional code with no side-effects when in fact the reduce operation inherently has side effects.

To fix this, instead of allocating an empty tensor to mimic the reduce operation, we now call the actual all_reduce during compile and warmup as well.

Multiple users reported an issue that tracks back to this root cause on Ampere hardware, which is one platform where we use custom_all_reduce by default if tensor parallelism is used and NVLink is available.

Examples:

Note that until I rebased this on top of a very recent main, I also had to tag the all_reduce custom op registration in parallel_state.py with torch.Tag.maybe_aliasing_or_mutating to fix the reproducer described in #29998. The tag is no longer required on latest main, although I have not identified exactly why or what changed that fixed that.

Test Plan

On Ampere hardware with NVLink enabled (2x A5500 in my case), start gpt-oss-20b with tensor-parallel-size 2:

vllm serve openai/gpt-oss-20b \
  --tool-call-parser openai \
  --enable-auto-tool-choice \
  --tensor-parallel-size 2

Then, execute the curl reproducer from #29998 in a loop:

while true; do curl -X POST http://localhost:8000/v1/chat/completions  -H "Content-Type: application/json"   -d '{
    "model": "openai/gpt-oss-20b",
    "stream": false,
    "messages": [
      {
        "role": "system",
        "content": "Be a helpful assistant."
      },
      {
        "role": "user",
        "content": "Hi"
      },
      {
        "role": "assistant",
        "content": "How can I help you?"
      },
      {
        "role": "user",
        "content": "Do you like Monty Python?"
      }
    ],
    "tools": [
      {
        "type": "function",
        "function": {
          "name": "CHANGE-NAME-BEFORE-SENDING",
          "description": "Use this tool if you need to extract information from a website.",
          "parameters": {
            "type": "object",
            "properties": {
              "url": {
                "type": "string",
                "description": "The URL to search or extract information from."
              }
            },
            "required": ["url"]
          }
        }
      }
    ]
  }'; done

I also exercised the reproducer from #30498, slightly modified to point to gpt-oss-20b (instead of 120b) and to raise the max_tokens to 8192 as it was generating valid but verbose output for the first few iterations of the loop.

Test Result

Before this change, the 2nd time the curl loop from #29998 executed it always hung and the vLLM continued to generated token id 0 indefinitely. After this change, the 2nd and all subsequent curl requests succeed as expected.

Before this change, the the python requests loop from #30498 always hung by the 2nd or 3rd request. After this change, it continues to work for multiple subsequent requests.

I was able to successfully reproduce the test cases in both of those issues on my hardware and this change fixes both of those.

Full CUDA graph capture + replay (the default since vLLM v0.11.0) could
lead to memory correctness issues in the custom_all_reduce
implementation due to previous logic that skipped running the actual
all_reduce operation during the compile and warmup phase. This resulted
in compile and warmup seeing a memory allocation, but not observing the
custom op call or the actual reduce operation which could lead to
optmizations being taken assuming purely functional code with no
side-effects when in fact the reduce operation inherently has side
effects.

To fix this, instead of allocating an empty tensor to mimic the reduce
operation, we now call the actual all_reduce during compile and warmup
as well.

Multiple users reported an issue that tracks back to this root cause on
Ampere hardware, which is one platform where we use custom_all_reduce by
default if tensor parallelism is used and NVLink is available.

Examples:
- vllm-project#26480
- vllm-project#29998
- vllm-project#30498

Note that until I rebased this on top of a very recent main, I also had
to tag the all_reduce custom op registration in parallel_state.py with
`torch.Tag.maybe_aliasing_or_mutating` to fix the reproducer described
in vllm-project#29998. The tag is no longer required on latest main, although I have
no identified exactly why or what changed that fixed that.

Signed-off-by: Ben Browning <[email protected]>
@chatgpt-codex-connector
Copy link

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a memory correctness issue in custom_all_reduce when using full CUDA graph capture. The previous implementation skipped the actual all-reduce operation during the warmup phase, leading to inconsistencies between warmup, capture, and replay, which could cause hangs. The fix correctly replaces the dummy tensor allocation with a call to the actual all_reduce operation during warmup. This ensures that the memory allocation, copy, and collective behaviors are consistent across all phases. The change is logical, well-explained, and directly resolves the reported bug. The code is clean and the fix is appropriate.

@bbrowning
Copy link
Contributor Author

While all user reports of this that I've come across have been on Ampere hardware (A5500, A6000, A1000 80GB), I have also been able to reproduce this on H100 GPUs when Triton ATTN is in use. After reproducing the same issue on an H100 (by setting VLLM_ATTENTION_BACKEND=TRITON_ATTN), I can confirm this fixes that issue on Hopper GPUs as well when Triton ATTN is in use.

@robertgshaw2-redhat robertgshaw2-redhat changed the title [Bugfix] Issue with custom_all_reduce and cudagraph_mode FULL [Bugfix] CustomAR + TritonAttn[AMPERE] + gpt-oss Dec 14, 2025
@robertgshaw2-redhat robertgshaw2-redhat changed the title [Bugfix] CustomAR + TritonAttn[AMPERE] + gpt-oss [Bugfix] CustomAR + TritonAttn[AMPERE] + FULL_CG - gpt-oss Dec 14, 2025
@mergify mergify bot added the gpt-oss Related to GPT-OSS models label Dec 14, 2025
@xyang16
Copy link
Contributor

xyang16 commented Dec 15, 2025

While all user reports of this that I've come across have been on Ampere hardware (A5500, A6000, A1000 80GB), I have also been able to reproduce this on H100 GPUs when Triton ATTN is in use. After reproducing the same issue on an H100 (by setting VLLM_ATTENTION_BACKEND=TRITON_ATTN), I can confirm this fixes that issue on Hopper GPUs as well when Triton ATTN is in use.

Thanks for the fix! Curious why skipping all_reduce operation would only cause problem for Triton ATTN on H100?

@bbrowning
Copy link
Contributor Author

While all user reports of this that I've come across have been on Ampere hardware (A5500, A6000, A1000 80GB), I have also been able to reproduce this on H100 GPUs when Triton ATTN is in use. After reproducing the same issue on an H100 (by setting VLLM_ATTENTION_BACKEND=TRITON_ATTN), I can confirm this fixes that issue on Hopper GPUs as well when Triton ATTN is in use.

Thanks for the fix! Curious why skipping all_reduce operation would only cause problem for Triton ATTN on H100?

I'm not 100% sure that doesn't cause problems for other attention backends. However, the reproducers we've had reported from users all had Triton ATTN in common, and were all on Ampere hardware (where we use Triton ATTN by default). By manually forcing Triton ATTN on H100 hardware, I was also able to reproduce it there.

There could theoretically be correctness issues with other attention backends in certain scenarios that this fixes, but I have no direct evidence of that myself.

@ApostaC
Copy link
Collaborator

ApostaC commented Dec 16, 2025

Hey @mgoin , could you please help review this PR? Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

gpt-oss Related to GPT-OSS models nvidia

Projects

Status: No status
Status: To Triage

Development

Successfully merging this pull request may close these issues.

3 participants