Skip to content

nvfp4_gemm produces incorrect numerical results on SM120 (Blackwell consumer) #14154

@QilinWan

Description

@QilinWan

Title: nvfp4_gemm produces incorrect numerical results on SM120 (Blackwell consumer)

GPU: RTX 5060 Ti (SM 120, Blackwell, 16 GB)
TRT-LLM: main@7021547 (2025-05-15)
CUDA: 13.2 / Driver 595.58.03
Model: Qwen3-8B-NVFP4 (ModelOpt-quantized)
Reproducer: https://github.com/QilinWan/TensorRT-LLM/tree/feat/blackwell-sm120-nvfp4-fallback


Summary

torch.ops.trtllm.nvfp4_gemm produces numerically incorrect results on SM120 (Blackwell consumer GPUs). All backends (CUTLASS, cuBLASLt, CuteDSL) are affected identically.

Evidence

Comparing nvfp4_gemm output to a BF16 dequant reference on the same layer:

Metric BF16 reference nvfp4_gemm Verdict
Correlation 1.000 0.113 Near-zero correlation
Relative error 0% 99% Complete mismatch
Output range -6.4 ~ +6.3 -3.2 ~ +4.7 Wrong magnitude

Correlation and relative error were identical across all backends (cutlass, cublaslt, cutedsl), confirming it is not a single-kernel regression.

Verified not the issue

  1. Weight dequant is correct — E2M1 lookup + scale expansion verified numerically: range -0.54~+0.54, mean 0.000002, no NaN/Inf.
  2. Input fp4_quantize is correcttorch.ops.trtllm.fp4_quantize produces valid FP4 tensors.
  3. SM120 architecture is compiledcuobjdump confirms sm_120 cubins in all relevant .so files.
  4. Fixed dtype mismatchweight_scale dtype is float8_e4m3fn in safetensors; nvfp4_gemm expects uint8. Conversion path exists in model loading but the kernel itself still produces wrong results.

Impact

Model initializes and generates without crashes, but output is degenerate (single-token repetition: [d。。。。22222...] or [甲方<<<...]). The model cannot produce coherent text when using nvfp4_gemm on SM120.

Workaround

An architecture-aware fallback is implemented at the Python level:

  • Detect torch.cuda.get_device_capability() == (12, 0)
  • Dequantize NVFP4 weights to BF16 on-the-fly using E2M1 lookup + scale expansion
  • Use torch.matmul instead of nvfp4_gemm
  • Disable Fp4QuantizedTensor activation wrapping to avoid NaN in activation dequant

Branch: feat/blackwell-sm120-nvfp4-fallback
Performance: ~0.6 tok/s (Python dequant bottleneck, acceptable for validation)
Output: diverse tokens (real Chinese + English words), semi-gibberish due to numerical accumulation

Discussion

This appears to be a kernel-level SM120 compatibility issue. The expected fix in PR #4821 (TRT-LLM v1.4) should resolve it. This issue provides real-hardware validation data for that fix.

Full Diagnostic Script

See /tmp/test_precise_gemm.py in the feature branch for the step-by-step layer-level diagnostic that produced the correlation data above.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Customized kernels<NV>Specialized/modified CUDA kernels in TRTLLM for LLM ops, beyond standard TRT. Dev & perf.bugSomething isn't workingwaiting for feedback

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions