Skip to content

Conversation

@xyang16
Copy link
Contributor

@xyang16 xyang16 commented Dec 13, 2025

Purpose

This PR is to fix NaN issue with fused_moe_lora. Currently running test_gptoss_tp.py will fail. We found it's because NaN values in tensor causing with triton_kernels having wrong gather_indx.

  • attention output tensor is created as output = torch.empty(output_shape, dtype=output_dtype, device=query.device) in here, which was changed from output = torch.zeros(output_shape, dtype=output_dtype, device=query.device) by remove attn output view kernel #26680. torch.empty() allocates uninitialized memory and may contain NaNs. Then in torch.ops.vllm.unified_attention_with_output it writes output[:num_actual_tokens] here, so output tensor is filled only num_actual_tokens rows. We see the case where output shape [4, 64, 64], num_actual_tokens is 3, the last row has NaNs.
  • triton_kernels will have wrong gather_indx if the last few rows have NaNs, below is an example:
  1. attn_output have a NaN in the last row:
attn_output: tensor([[ 4.8633e-01, -6.0938e-01, -1.0156e+00,  ..., -4.5166e-02,
          7.8516e-01,  5.1562e-01],
        [ 2.5391e-01,  5.9375e-01,  1.5820e-01,  ...,  1.0059e-01,
          5.8203e-01,  5.0391e-01],
        [ 3.1836e-01,  5.8203e-01,  1.2354e-01,  ...,  1.1816e-01,
          5.2734e-01,  4.1797e-01],
        [ 4.9181e+37, -4.4336e-01,  1.3559e-31,  ...,  6.2988e-02,
          nan,  2.7930e-01]], device='cuda:0', dtype=torch.bfloat16)
  1. After o_proj, the whole row become NaNs:
output: tensor([[-0.1729,  0.9141, -0.0781,  ...,  0.3047,  0.0059, -0.2129],
        [-0.5938,  0.6445, -0.1147,  ...,  0.8242, -0.0527, -0.1787],
        [-0.5664,  0.6953, -0.1157,  ...,  0.8086, -0.1084, -0.1602],
        [    nan,     nan,     nan,  ...,     nan,     nan,     nan]],
       device='cuda:0', dtype=torch.bfloat16)
  1. Since the last row of hidden_states is NaNs, after select_experts, the last row of topk_ids is all 0s:
topk_ids: tensor([[18,  4, 29,  0],
        [27, 29,  1,  4],
        [27, 29,  1,  4],
        [ 0,  0,  0,  0]], device='cuda:0', dtype=torch.int32)
topk_weights: tensor([[ 5.0797e-01,  2.3439e-01,  1.4161e-01,  1.1603e-01],
        [ 3.3487e-01,  2.9094e-01,  2.3378e-01,  1.4041e-01],
        [ 3.3448e-01,  2.9060e-01,  2.3078e-01,  1.4414e-01],
        [        nan, -1.0000e+04, -1.0000e+04, -1.0000e+04]], device='cuda:0')

Note: This step also exposes some problem in fused_topk. Because I see torch.topk is able to generate unique topk_ids [1, 0, 2, 3] for NaN tensor. So it would be better for fused_topk to be able to handle NaN tensor and generate unique topk_ids, instead of [0, 0, 0, 0]. This could avoid the problem in the later steps.

>>> router_logits = torch.tensor([float("nan"), float("nan"), float("nan"), float("nan")], device="cuda", dtype=torch.bfloat16)
>>> topk_weights, topk_ids = torch.topk(router_logits, k=4, dim=-1)
>>> topk_ids
tensor([1, 0, 2, 3], device='cuda:0')
>>> topk_weights
tensor([nan, nan, nan, nan], device='cuda:0', dtype=torch.bfloat16)
  1. After pack_bitmatrix, the last row of bitmatrix become 1. So looks like it is expecting topk_ids to have unique values in each row, instead of [0, 0, 0, 0].
bitmatrix: Bitmatrix(storage=Storage(data=tensor([[537133073],
        [671088658],
        [671088658],
        [        1]], device='cuda:0', dtype=torch.uint32)
  1. After routing_from_bitmatrix, it generates incorrect gather_indx. gather_indx should have row index for all 16 rows, but row index 0, 4, 8 are replaced with -1. This means row 0, 4, 8 will be missing in the first matmul_ogs (This will impact non-LoRA as well).
gather_indx: GatherIndx(src_indx=tensor([12, 13, 14, 15,  2,  5,  9,  1,  7, 11,  3,  6, 10, -1, -1, -1],
       device='cuda:0', dtype=torch.int32), dst_indx=tensor([ 1,  7,  4, 10,  2,  5, 11,  8,  3,  6, 12,  9,  0,  1,  2,  3],
       device='cuda:0', dtype=torch.int32))
  1. After the first matmul_ogs, the rows are reordered by gather_indx, the first row in intermediate_cache1 also having NaN values.
intermediate_cache1: tensor([[    nan,     nan,     nan,  ...,     nan,     nan,     nan],
        [-0.7617, -0.7852,  0.4492,  ..., -1.8281, -0.8789, -1.2812],
        [-0.8398, -0.5508,  0.2412,  ..., -1.5703, -0.5234,  0.4922],
        ...,
        [    nan,     nan,     nan,  ...,     nan,     nan,     nan],
        [    nan,     nan,     nan,  ...,     nan,     nan,     nan],
        [    nan,     nan,     nan,  ...,     nan,     nan,     nan]],
       device='cuda:0', dtype=torch.bfloat16)
  1. Then in _fused_moe_lora_expand, b_intermediate_cache1 is added into output (output is intermediate_cache1 in the previous step). b_intermediate_cache1 is having NaNs in last few rows because hidden_states is not reordered, output is having NaNs in the first row and last rows because it's reordered. So it's adding wrong rows and now output having more NaN rows.
output: tensor([[[    nan,     nan,     nan,  ...,     nan,     nan,     nan],
         [-0.7578, -0.7891,  0.4277,  ..., -1.8516, -0.8867, -1.2891],
         [-0.8398, -0.5547,  0.2305,  ..., -1.5625, -0.5312,  0.4980],
         [-2.0938, -0.5078, -0.8984,  ..., -1.1484,  0.1650, -0.5898]],

        [[    nan,     nan,     nan,  ...,     nan,     nan,     nan],
         [-0.8828, -1.5391, -0.4688,  ...,  1.2812,  0.2773, -3.1875],
         [-0.3867, -1.6953, -2.0781,  ..., -1.3750,  0.2236, -1.9688],
         [-0.4688, -0.9531, -0.5352,  ..., -1.4531, -1.0391,  0.0371]],

        [[    nan,     nan,     nan,  ...,     nan,     nan,     nan],
         [-0.9688, -1.5000, -0.4297,  ...,  1.2344,  0.2715, -3.2031],
         [-0.3945, -1.7031, -2.0625,  ..., -1.3672,  0.2100, -1.9531],
         [-0.4570, -0.9648, -0.5430,  ..., -1.4609, -1.0547,  0.0251]],

        [[    nan,     nan,     nan,  ...,     nan,     nan,     nan],
         [    nan,     nan,     nan,  ...,     nan,     nan,     nan],
         [    nan,     nan,     nan,  ...,     nan,     nan,     nan],
         [    nan,     nan,     nan,  ...,     nan,     nan,     nan]]],
       device='cuda:0', dtype=torch.bfloat16)
  1. After moe_sum reduction, the whole tensor become NaNs:
output: tensor([[nan, nan, nan,  ..., nan, nan, nan],
        [nan, nan, nan,  ..., nan, nan, nan],
        [nan, nan, nan,  ..., nan, nan, nan],
        [nan, nan, nan,  ..., nan, nan, nan]], device='cuda:0',
       dtype=torch.bfloat16)

So this PR fixes the NaN issues by:

  • Fills the last rows of attention output[num_actual_tokens:] with 0 to avoid NaNs in the tensor.
    • An alternative is to fix it in fused_topk. But I feel better to fill output[num_actual_tokens:] with 0, because NaN values in tensor might cause unexpected behaviors in somewhere else too. So please let me know how you think.
  • Reorder intermediate_cache1 back to make sure LoRA weights is added to the correct rows here.

Note: This PR doesn't address the NaN caused by FULL_AND_PIECEWISE cudagraph mode, see #29539 (comment), so need to set cudagraph_mode to PIECEWISE + this PR to make it work.

Test Plan

  • test_modular_oai_triton_moe.py
pytest -s -v tests/kernels/moe/test_modular_oai_triton_moe.py

Tests passed.

  • test_gptoss_tp.py

Run test_gptoss_tp.py with modification:

llm = vllm.LLM(
    MODEL_PATH,
    max_model_len=1024,
    enable_lora=True,
    max_loras=4,
    max_lora_rank=8,
    compilation_config=vllm.config.CompilationConfig(  # Avoid OOM
        cudagraph_mode=vllm.config.compilation.CUDAGraphMode.PIECEWISE,
        cudagraph_specialize_lora=False,
    ),
)
pytest -s -v tests/lora/test_gptoss_tp.py

Main:

Generated text: 'SELECT AVG!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!'
Generated text: 'SELECT!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!'
Generated text: 'SELECT!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!'

PR:

Generated text: 'SELECT AVG(Working_Horses) FROM farm WHERE Total_Horses > 5000;'
Generated text: 'SELECT MAX(Cows) AS Max_Cows, MIN(Cows) AS Min_Cows FROM farm;'
Generated text: 'SELECT MAX(Cows) AS Max_Cows, MIN(Cows) AS Min_Cows FROM farm;'

Garbage output fixed. Tests passed.

Accuracy Testing

Marlin:

VLLM_MXFP4_USE_MARLIN=1 vllm serve openai/gpt-oss-20b \
  --tensor-parallel-size 1 \
  --max-num-seqs 16 \
  --compilation-config '{"cudagraph_mode": "PIECEWISE"}' \
  --enable-lora \
  --max-loras 1 \
  --lora-modules lora1=/opt/dlami/nvme/models/gpt-oss-20b-lora-gpqa/checkpoint-13 \
  --max-lora-rank 32 \
  --no-enable-prefix-caching
OPENAI_API_KEY=EMPTY python3 -m gpt_oss.evals --model lora1 --eval gpqa --n-threads 200 --reasoning-effort low
Writing report to /tmp/gpqa_lora1-low_temp1.0_20251214_195839.html
{'chars': np.float64(21.839015151515152), 'chars:std': np.float64(141.08266424169597), 'score': np.float64(0.586489898989899), 'score:std': np.float64(0.4924626862745207)}
Writing results to /tmp/gpqa_lora1-low_temp1.0_20251214_195839.json
Writing all results to /tmp/gpqa_lora1-low_temp1.0_20251214_195839_allresults.json
[{'eval_name': 'gpqa', 'model_name': 'lora1-low_temp1.0_20251214_195839', 'metric': 0.586489898989899}]

Triton:

vllm serve openai/gpt-oss-20b \
  --tensor-parallel-size 1 \
  --max-num-seqs 16 \
  --compilation-config '{"cudagraph_mode": "PIECEWISE"}' \
  --enable-lora \
  --max-loras 1 \
  --lora-modules lora1=/opt/dlami/nvme/models/gpt-oss-20b-lora-gpqa/checkpoint-13 \
  --max-lora-rank 32 \
  --no-enable-prefix-caching
OPENAI_API_KEY=EMPTY python3 -m gpt_oss.evals --model lora1 --eval gpqa --n-threads 200 --reasoning-effort low
Writing report to /tmp/gpqa_lora1-low_temp1.0_20251214_201352.html
{'chars': np.float64(25.067550505050505), 'chars:std': np.float64(149.54616903920677), 'score': np.float64(0.5883838383838383), 'score:std': np.float64(0.4921263019922218)}
Writing results to /tmp/gpqa_lora1-low_temp1.0_20251214_201352.json
Writing all results to /tmp/gpqa_lora1-low_temp1.0_20251214_201352_allresults.json
[{'eval_name': 'gpqa', 'model_name': 'lora1-low_temp1.0_20251214_201352', 'metric': 0.5883838383838383}]

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

cc @robertgshaw2-redhat @jeejeelee

@chatgpt-codex-connector
Copy link

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.

@mergify mergify bot added the v1 label Dec 13, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly addresses a critical bug where NaN values could appear in the attention output. The root cause is that the output tensor, allocated with torch.empty(), was not fully initialized, and subsequent attention operations only filled a portion of it up to num_actual_tokens. The added line output[num_actual_tokens:].fill_(0) effectively zeros out the remaining uninitialized part of the tensor, preventing any garbage values or NaNs from propagating. This is a robust and necessary fix. This same pattern of not zeroing out the padded portion of the output tensor may exist in other attention backends, and it would be beneficial to audit them for similar issues to ensure consistent behavior across the system.

@dcmaddix
Copy link
Contributor

Great find thanks a lot @xyang16! cc: @jeejeelee, @robertgshaw2-redhat, @varun-sundar-rabindranath

@xyang16 xyang16 changed the title [Bugfix] Fix NaN issue in attention output [Bugfix] Fix NaN issue for Triton FusedMoE LoRA Dec 13, 2025
@mergify mergify bot added the gpt-oss Related to GPT-OSS models label Dec 14, 2025
@mergify
Copy link

mergify bot commented Dec 14, 2025

Hi @xyang16, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

@mergify
Copy link

mergify bot commented Dec 14, 2025

Hi @xyang16, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

@robertgshaw2-redhat
Copy link
Collaborator

@jeejeelee - lmk if this looks okay to you

@bbrowning
Copy link
Contributor

I was able to reproduce this error on my A5500 hardware by running pytest -sv tests/lora/test_gptoss_tp.py and 2 of the tests failed because they generated 'SELECT!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!'

However, instead of applying the fix here, I applied the fix from #30650 and after that all of these tests passed. So, these two PRs and code paths are at least related if not directly intertwined.

@bbrowning
Copy link
Contributor

I believe there are two separate things in play here. The change to vllm/v1/attention/backends/flash_attn.py here looks directly related to #30650, and we probably either need to zero these in all backends or use the fix from 30650 to ensure we're exercising the custom_all_reduce path during compile/warmup.

With that said, I tried testing just the changes in this PR on an H100 that's using FLASH_ATTN and Triton MXFP4 kernels and am still seeing the infinite generation:

E           AssertionError: assert False                                                                                                                                                                            
E            +  where False = <built-in method startswith of str object at 0x7f70c96bf630>('SELECT AVG(Working_Horses) FROM farm WHERE Total_Horses > 5000;')
E            +    where <built-in method startswith of str object at 0x7f70c96bf630> = 'SELECT AVG(!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!'.startswith

Logs from the failed test showing FLASH_ATTN and Triton MXFP4 backend in use:

(EngineCore_DP0 pid=220174) (Worker_TP0 pid=220180) INFO 12-15 23:59:56 [gpu_model_runner.py:3562] Starting to load model openai/gpt-oss-20b...
(EngineCore_DP0 pid=220174) (Worker_TP0 pid=220180) INFO 12-15 23:59:56 [cuda.py:351] Using FLASH_ATTN attention backend out of potential backends: ('FLASH_ATTN', 'TRITON_ATTN')
(EngineCore_DP0 pid=220174) (Worker_TP0 pid=220180) INFO 12-15 23:59:56 [layer.py:372] Enabled separate cuda stream for MoE shared_experts  
(EngineCore_DP0 pid=220174) (Worker_TP0 pid=220180) INFO 12-15 23:59:56 [mxfp4.py:102] [get_mxfp4_backend_with_lora] Using Triton backend
(EngineCore_DP0 pid=220174) (Worker_TP1 pid=220182) INFO 12-15 23:59:56 [mxfp4.py:102] [get_mxfp4_backend_with_lora] Using Triton backend

@xyang16
Copy link
Contributor Author

xyang16 commented Dec 16, 2025

@bbrowning Thanks for helping investigate this!

I believe there are two separate things in play here. The change to vllm/v1/attention/backends/flash_attn.py here looks directly related to #30650, and we probably either need to zero these in all backends or use the fix from 30650 to ensure we're exercising the custom_all_reduce path during compile/warmup.

Is there going to be other reason other than skipping custom_all_reduce would make output[num_actual_tokens:] having NaNs?

With that said, I tried testing just the changes in this PR on an H100 that's using FLASH_ATTN and Triton MXFP4 kernels and am still seeing the infinite generation:

Yes, I have a note that says: This PR doesn't address the NaN caused by FULL_AND_PIECEWISE cudagraph mode, see #29539 (comment), so need to set cudagraph_mode to PIECEWISE + this PR to make it work.

llm = vllm.LLM(
    MODEL_PATH,
    max_model_len=1024,
    enable_lora=True,
    max_loras=4,
    max_lora_rank=8,
    compilation_config=vllm.config.CompilationConfig(  # Avoid OOM
        cudagraph_mode=vllm.config.compilation.CUDAGraphMode.PIECEWISE,
        cudagraph_specialize_lora=False,
    ),
)

@xyang16 xyang16 force-pushed the nan branch 2 times, most recently from 6234130 to 6287d37 Compare December 17, 2025 18:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

gpt-oss Related to GPT-OSS models v1

Projects

Status: To Triage

Development

Successfully merging this pull request may close these issues.

5 participants