[Bugfix] Fix NaN issue for Triton FusedMoE LoRA #30585

xyang16 · 2025-12-13T01:35:10Z

Purpose

This PR is to fix NaN issue with fused_moe_lora. Currently running test_gptoss_tp.py will fail. We found it's because NaN values in tensor causing with triton_kernels having wrong gather_indx.

attention output tensor is created as output = torch.empty(output_shape, dtype=output_dtype, device=query.device) in here, which was changed from output = torch.zeros(output_shape, dtype=output_dtype, device=query.device) by remove attn output view kernel #26680. torch.empty() allocates uninitialized memory and may contain NaNs. Then in torch.ops.vllm.unified_attention_with_output it writes output[:num_actual_tokens] here, so output tensor is filled only num_actual_tokens rows. We see the case where output shape [4, 64, 64], num_actual_tokens is 3, the last row has NaNs.
triton_kernels will have wrong gather_indx if the last few rows have NaNs, below is an example:

attn_output have a NaN in the last row:

attn_output: tensor([[ 4.8633e-01, -6.0938e-01, -1.0156e+00,  ..., -4.5166e-02,
          7.8516e-01,  5.1562e-01],
        [ 2.5391e-01,  5.9375e-01,  1.5820e-01,  ...,  1.0059e-01,
          5.8203e-01,  5.0391e-01],
        [ 3.1836e-01,  5.8203e-01,  1.2354e-01,  ...,  1.1816e-01,
          5.2734e-01,  4.1797e-01],
        [ 4.9181e+37, -4.4336e-01,  1.3559e-31,  ...,  6.2988e-02,
          nan,  2.7930e-01]], device='cuda:0', dtype=torch.bfloat16)

After o_proj, the whole row become NaNs:

output: tensor([[-0.1729,  0.9141, -0.0781,  ...,  0.3047,  0.0059, -0.2129],
        [-0.5938,  0.6445, -0.1147,  ...,  0.8242, -0.0527, -0.1787],
        [-0.5664,  0.6953, -0.1157,  ...,  0.8086, -0.1084, -0.1602],
        [    nan,     nan,     nan,  ...,     nan,     nan,     nan]],
       device='cuda:0', dtype=torch.bfloat16)

Since the last row of hidden_states is NaNs, after select_experts, the last row of topk_ids is all 0s:

topk_ids: tensor([[18,  4, 29,  0],
        [27, 29,  1,  4],
        [27, 29,  1,  4],
        [ 0,  0,  0,  0]], device='cuda:0', dtype=torch.int32)

topk_weights: tensor([[ 5.0797e-01,  2.3439e-01,  1.4161e-01,  1.1603e-01],
        [ 3.3487e-01,  2.9094e-01,  2.3378e-01,  1.4041e-01],
        [ 3.3448e-01,  2.9060e-01,  2.3078e-01,  1.4414e-01],
        [        nan, -1.0000e+04, -1.0000e+04, -1.0000e+04]], device='cuda:0')

Note: This step also exposes some problem in fused_topk. Because I see torch.topk is able to generate unique topk_ids [1, 0, 2, 3] for NaN tensor. So it would be better for fused_topk to be able to handle NaN tensor and generate unique topk_ids, instead of [0, 0, 0, 0]. This could avoid the problem in the later steps.

>>> router_logits = torch.tensor([float("nan"), float("nan"), float("nan"), float("nan")], device="cuda", dtype=torch.bfloat16)
>>> topk_weights, topk_ids = torch.topk(router_logits, k=4, dim=-1)
>>> topk_ids
tensor([1, 0, 2, 3], device='cuda:0')
>>> topk_weights
tensor([nan, nan, nan, nan], device='cuda:0', dtype=torch.bfloat16)

After pack_bitmatrix, the last row of bitmatrix become 1. So looks like it is expecting topk_ids to have unique values in each row, instead of [0, 0, 0, 0].

bitmatrix: Bitmatrix(storage=Storage(data=tensor([[537133073],
        [671088658],
        [671088658],
        [        1]], device='cuda:0', dtype=torch.uint32)

After routing_from_bitmatrix, it generates incorrect gather_indx. gather_indx should have row index for all 16 rows, but row index 0, 4, 8 are replaced with -1. This means row 0, 4, 8 will be missing in the first matmul_ogs (This will impact non-LoRA as well).

gather_indx: GatherIndx(src_indx=tensor([12, 13, 14, 15,  2,  5,  9,  1,  7, 11,  3,  6, 10, -1, -1, -1],
       device='cuda:0', dtype=torch.int32), dst_indx=tensor([ 1,  7,  4, 10,  2,  5, 11,  8,  3,  6, 12,  9,  0,  1,  2,  3],
       device='cuda:0', dtype=torch.int32))

After the first matmul_ogs, the rows are reordered by gather_indx, the first row in intermediate_cache1 also having NaN values.

intermediate_cache1: tensor([[    nan,     nan,     nan,  ...,     nan,     nan,     nan],
        [-0.7617, -0.7852,  0.4492,  ..., -1.8281, -0.8789, -1.2812],
        [-0.8398, -0.5508,  0.2412,  ..., -1.5703, -0.5234,  0.4922],
        ...,
        [    nan,     nan,     nan,  ...,     nan,     nan,     nan],
        [    nan,     nan,     nan,  ...,     nan,     nan,     nan],
        [    nan,     nan,     nan,  ...,     nan,     nan,     nan]],
       device='cuda:0', dtype=torch.bfloat16)

Then in _fused_moe_lora_expand, b_intermediate_cache1 is added into output (output is intermediate_cache1 in the previous step). b_intermediate_cache1 is having NaNs in last few rows because hidden_states is not reordered, output is having NaNs in the first row and last rows because it's reordered. So it's adding wrong rows and now output having more NaN rows.

output: tensor([[[    nan,     nan,     nan,  ...,     nan,     nan,     nan],
         [-0.7578, -0.7891,  0.4277,  ..., -1.8516, -0.8867, -1.2891],
         [-0.8398, -0.5547,  0.2305,  ..., -1.5625, -0.5312,  0.4980],
         [-2.0938, -0.5078, -0.8984,  ..., -1.1484,  0.1650, -0.5898]],

        [[    nan,     nan,     nan,  ...,     nan,     nan,     nan],
         [-0.8828, -1.5391, -0.4688,  ...,  1.2812,  0.2773, -3.1875],
         [-0.3867, -1.6953, -2.0781,  ..., -1.3750,  0.2236, -1.9688],
         [-0.4688, -0.9531, -0.5352,  ..., -1.4531, -1.0391,  0.0371]],

        [[    nan,     nan,     nan,  ...,     nan,     nan,     nan],
         [-0.9688, -1.5000, -0.4297,  ...,  1.2344,  0.2715, -3.2031],
         [-0.3945, -1.7031, -2.0625,  ..., -1.3672,  0.2100, -1.9531],
         [-0.4570, -0.9648, -0.5430,  ..., -1.4609, -1.0547,  0.0251]],

        [[    nan,     nan,     nan,  ...,     nan,     nan,     nan],
         [    nan,     nan,     nan,  ...,     nan,     nan,     nan],
         [    nan,     nan,     nan,  ...,     nan,     nan,     nan],
         [    nan,     nan,     nan,  ...,     nan,     nan,     nan]]],
       device='cuda:0', dtype=torch.bfloat16)

After moe_sum reduction, the whole tensor become NaNs:

output: tensor([[nan, nan, nan,  ..., nan, nan, nan],
        [nan, nan, nan,  ..., nan, nan, nan],
        [nan, nan, nan,  ..., nan, nan, nan],
        [nan, nan, nan,  ..., nan, nan, nan]], device='cuda:0',
       dtype=torch.bfloat16)

So this PR fixes the NaN issues by:

Fills the last rows of attention output[num_actual_tokens:] with 0 to avoid NaNs in the tensor.
- An alternative is to fix it in fused_topk. But I feel better to fill output[num_actual_tokens:] with 0, because NaN values in tensor might cause unexpected behaviors in somewhere else too. So please let me know how you think.
Reorder intermediate_cache1 back to make sure LoRA weights is added to the correct rows here.

Note: This PR doesn't address the NaN caused by FULL_AND_PIECEWISE cudagraph mode, see #29539 (comment), so need to set cudagraph_mode to PIECEWISE + this PR to make it work.

Test Plan

test_modular_oai_triton_moe.py

pytest -s -v tests/kernels/moe/test_modular_oai_triton_moe.py

Tests passed.

test_gptoss_tp.py

Run test_gptoss_tp.py with modification:

llm = vllm.LLM(
    MODEL_PATH,
    max_model_len=1024,
    enable_lora=True,
    max_loras=4,
    max_lora_rank=8,
    compilation_config=vllm.config.CompilationConfig(  # Avoid OOM
        cudagraph_mode=vllm.config.compilation.CUDAGraphMode.PIECEWISE,
        cudagraph_specialize_lora=False,
    ),
)

pytest -s -v tests/lora/test_gptoss_tp.py

Main:

Generated text: 'SELECT AVG!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!'
Generated text: 'SELECT!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!'
Generated text: 'SELECT!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!'

PR:

Generated text: 'SELECT AVG(Working_Horses) FROM farm WHERE Total_Horses > 5000;'
Generated text: 'SELECT MAX(Cows) AS Max_Cows, MIN(Cows) AS Min_Cows FROM farm;'
Generated text: 'SELECT MAX(Cows) AS Max_Cows, MIN(Cows) AS Min_Cows FROM farm;'

Garbage output fixed. Tests passed.

Accuracy Testing

Marlin:

VLLM_MXFP4_USE_MARLIN=1 vllm serve openai/gpt-oss-20b \
  --tensor-parallel-size 1 \
  --max-num-seqs 16 \
  --compilation-config '{"cudagraph_mode": "PIECEWISE"}' \
  --enable-lora \
  --max-loras 1 \
  --lora-modules lora1=/opt/dlami/nvme/models/gpt-oss-20b-lora-gpqa/checkpoint-13 \
  --max-lora-rank 32 \
  --no-enable-prefix-caching

OPENAI_API_KEY=EMPTY python3 -m gpt_oss.evals --model lora1 --eval gpqa --n-threads 200 --reasoning-effort low

Writing report to /tmp/gpqa_lora1-low_temp1.0_20251214_195839.html
{'chars': np.float64(21.839015151515152), 'chars:std': np.float64(141.08266424169597), 'score': np.float64(0.586489898989899), 'score:std': np.float64(0.4924626862745207)}
Writing results to /tmp/gpqa_lora1-low_temp1.0_20251214_195839.json
Writing all results to /tmp/gpqa_lora1-low_temp1.0_20251214_195839_allresults.json
[{'eval_name': 'gpqa', 'model_name': 'lora1-low_temp1.0_20251214_195839', 'metric': 0.586489898989899}]

Triton:

vllm serve openai/gpt-oss-20b \
  --tensor-parallel-size 1 \
  --max-num-seqs 16 \
  --compilation-config '{"cudagraph_mode": "PIECEWISE"}' \
  --enable-lora \
  --max-loras 1 \
  --lora-modules lora1=/opt/dlami/nvme/models/gpt-oss-20b-lora-gpqa/checkpoint-13 \
  --max-lora-rank 32 \
  --no-enable-prefix-caching

OPENAI_API_KEY=EMPTY python3 -m gpt_oss.evals --model lora1 --eval gpqa --n-threads 200 --reasoning-effort low

Writing report to /tmp/gpqa_lora1-low_temp1.0_20251214_201352.html
{'chars': np.float64(25.067550505050505), 'chars:std': np.float64(149.54616903920677), 'score': np.float64(0.5883838383838383), 'score:std': np.float64(0.4921263019922218)}
Writing results to /tmp/gpqa_lora1-low_temp1.0_20251214_201352.json
Writing all results to /tmp/gpqa_lora1-low_temp1.0_20251214_201352_allresults.json
[{'eval_name': 'gpqa', 'model_name': 'lora1-low_temp1.0_20251214_201352', 'metric': 0.5883838383838383}]

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

cc @robertgshaw2-redhat @jeejeelee

chatgpt-codex-connector · 2025-12-13T01:35:18Z

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.

gemini-code-assist

Code Review

This pull request correctly addresses a critical bug where NaN values could appear in the attention output. The root cause is that the output tensor, allocated with torch.empty(), was not fully initialized, and subsequent attention operations only filled a portion of it up to num_actual_tokens. The added line output[num_actual_tokens:].fill_(0) effectively zeros out the remaining uninitialized part of the tensor, preventing any garbage values or NaNs from propagating. This is a robust and necessary fix. This same pattern of not zeroing out the padded portion of the output tensor may exist in other attention backends, and it would be beneficial to audit them for similar issues to ensure consistent behavior across the system.

dcmaddix · 2025-12-13T05:02:50Z

Great find thanks a lot @xyang16! cc: @jeejeelee, @robertgshaw2-redhat, @varun-sundar-rabindranath

mergify · 2025-12-14T06:14:17Z

Hi @xyang16, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

mergify · 2025-12-14T06:48:29Z

Hi @xyang16, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

robertgshaw2-redhat · 2025-12-15T00:45:45Z

@jeejeelee - lmk if this looks okay to you

bbrowning · 2025-12-15T18:43:34Z

I was able to reproduce this error on my A5500 hardware by running pytest -sv tests/lora/test_gptoss_tp.py and 2 of the tests failed because they generated 'SELECT!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!'

However, instead of applying the fix here, I applied the fix from #30650 and after that all of these tests passed. So, these two PRs and code paths are at least related if not directly intertwined.

bbrowning · 2025-12-16T00:03:18Z

I believe there are two separate things in play here. The change to vllm/v1/attention/backends/flash_attn.py here looks directly related to #30650, and we probably either need to zero these in all backends or use the fix from 30650 to ensure we're exercising the custom_all_reduce path during compile/warmup.

With that said, I tried testing just the changes in this PR on an H100 that's using FLASH_ATTN and Triton MXFP4 kernels and am still seeing the infinite generation:

E           AssertionError: assert False                                                                                                                                                                            
E            +  where False = <built-in method startswith of str object at 0x7f70c96bf630>('SELECT AVG(Working_Horses) FROM farm WHERE Total_Horses > 5000;')
E            +    where <built-in method startswith of str object at 0x7f70c96bf630> = 'SELECT AVG(!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!'.startswith

Logs from the failed test showing FLASH_ATTN and Triton MXFP4 backend in use:

(EngineCore_DP0 pid=220174) (Worker_TP0 pid=220180) INFO 12-15 23:59:56 [gpu_model_runner.py:3562] Starting to load model openai/gpt-oss-20b...
(EngineCore_DP0 pid=220174) (Worker_TP0 pid=220180) INFO 12-15 23:59:56 [cuda.py:351] Using FLASH_ATTN attention backend out of potential backends: ('FLASH_ATTN', 'TRITON_ATTN')
(EngineCore_DP0 pid=220174) (Worker_TP0 pid=220180) INFO 12-15 23:59:56 [layer.py:372] Enabled separate cuda stream for MoE shared_experts  
(EngineCore_DP0 pid=220174) (Worker_TP0 pid=220180) INFO 12-15 23:59:56 [mxfp4.py:102] [get_mxfp4_backend_with_lora] Using Triton backend
(EngineCore_DP0 pid=220174) (Worker_TP1 pid=220182) INFO 12-15 23:59:56 [mxfp4.py:102] [get_mxfp4_backend_with_lora] Using Triton backend

xyang16 · 2025-12-16T17:16:16Z

@bbrowning Thanks for helping investigate this!

I believe there are two separate things in play here. The change to vllm/v1/attention/backends/flash_attn.py here looks directly related to #30650, and we probably either need to zero these in all backends or use the fix from 30650 to ensure we're exercising the custom_all_reduce path during compile/warmup.

Is there going to be other reason other than skipping custom_all_reduce would make output[num_actual_tokens:] having NaNs?

With that said, I tried testing just the changes in this PR on an H100 that's using FLASH_ATTN and Triton MXFP4 kernels and am still seeing the infinite generation:

Yes, I have a note that says: This PR doesn't address the NaN caused by FULL_AND_PIECEWISE cudagraph mode, see #29539 (comment), so need to set cudagraph_mode to PIECEWISE + this PR to make it work.

llm = vllm.LLM(
    MODEL_PATH,
    max_model_len=1024,
    enable_lora=True,
    max_loras=4,
    max_lora_rank=8,
    compilation_config=vllm.config.CompilationConfig(  # Avoid OOM
        cudagraph_mode=vllm.config.compilation.CUDAGraphMode.PIECEWISE,
        cudagraph_specialize_lora=False,
    ),
)

Signed-off-by: Xin Yang <[email protected]>

xyang16 requested a review from LucasWilkinson as a code owner December 13, 2025 01:35

mergify bot added the v1 label Dec 13, 2025

gemini-code-assist bot reviewed Dec 13, 2025

View reviewed changes

xyang16 changed the title ~~[Bugfix] Fix NaN issue in attention output~~ [Bugfix] Fix NaN issue for Triton FusedMoE LoRA Dec 13, 2025

xyang16 requested review from mgoin and pavanimajety as code owners December 14, 2025 06:10

mergify bot added the gpt-oss Related to GPT-OSS models label Dec 14, 2025

github-project-automation bot added this to gpt-oss Issues & Enhancements Dec 14, 2025

github-project-automation bot moved this to To Triage in gpt-oss Issues & Enhancements Dec 14, 2025

xyang16 force-pushed the nan branch from 20a1fcf to 7c65f9d Compare December 14, 2025 06:44

xyang16 force-pushed the nan branch 2 times, most recently from 851c552 to b1e7f12 Compare December 14, 2025 23:33

robertgshaw2-redhat assigned jeejeelee Dec 15, 2025

xyang16 mentioned this pull request Dec 15, 2025

[LoRA] Set default MXFP4 LoRA backend to Marlin #30598

Open

5 tasks

[Bugfix] Fix NaN issue in attention output

2fb75c2

Signed-off-by: Xin Yang <[email protected]>

xyang16 force-pushed the nan branch 2 times, most recently from 6234130 to 6287d37 Compare December 17, 2025 18:23

reorder intermediate_cache

f31c2fd

Signed-off-by: Xin Yang <[email protected]>

xyang16 force-pushed the nan branch from 6287d37 to f31c2fd Compare December 17, 2025 19:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Bugfix] Fix NaN issue for Triton FusedMoE LoRA #30585

[Bugfix] Fix NaN issue for Triton FusedMoE LoRA #30585

xyang16 commented Dec 13, 2025 •

edited by github-actions bot

Loading

Uh oh!

chatgpt-codex-connector bot commented Dec 13, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

dcmaddix commented Dec 13, 2025

Uh oh!

mergify bot commented Dec 14, 2025

Uh oh!

mergify bot commented Dec 14, 2025

Uh oh!

robertgshaw2-redhat commented Dec 15, 2025

Uh oh!

bbrowning commented Dec 15, 2025

Uh oh!

bbrowning commented Dec 16, 2025

Uh oh!

xyang16 commented Dec 16, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Uh oh!

[Bugfix] Fix NaN issue for Triton FusedMoE LoRA #30585

Are you sure you want to change the base?

[Bugfix] Fix NaN issue for Triton FusedMoE LoRA #30585

Conversation

xyang16 commented Dec 13, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Accuracy Testing

Uh oh!

chatgpt-codex-connector bot commented Dec 13, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

dcmaddix commented Dec 13, 2025

Uh oh!

mergify bot commented Dec 14, 2025

Uh oh!

mergify bot commented Dec 14, 2025

Uh oh!

robertgshaw2-redhat commented Dec 15, 2025

Uh oh!

bbrowning commented Dec 15, 2025

Uh oh!

bbrowning commented Dec 16, 2025

Uh oh!

xyang16 commented Dec 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

xyang16 commented Dec 13, 2025 •

edited by github-actions bot

Loading

xyang16 commented Dec 16, 2025 •

edited

Loading