Title: nvfp4_gemm produces incorrect numerical results on SM120 (Blackwell consumer)
GPU: RTX 5060 Ti (SM 120, Blackwell, 16 GB)
TRT-LLM: main@7021547 (2025-05-15)
CUDA: 13.2 / Driver 595.58.03
Model: Qwen3-8B-NVFP4 (ModelOpt-quantized)
Reproducer: https://github.com/QilinWan/TensorRT-LLM/tree/feat/blackwell-sm120-nvfp4-fallback
Summary
torch.ops.trtllm.nvfp4_gemm produces numerically incorrect results on SM120 (Blackwell consumer GPUs). All backends (CUTLASS, cuBLASLt, CuteDSL) are affected identically.
Evidence
Comparing nvfp4_gemm output to a BF16 dequant reference on the same layer:
| Metric |
BF16 reference |
nvfp4_gemm |
Verdict |
| Correlation |
1.000 |
0.113 |
Near-zero correlation |
| Relative error |
0% |
99% |
Complete mismatch |
| Output range |
-6.4 ~ +6.3 |
-3.2 ~ +4.7 |
Wrong magnitude |
Correlation and relative error were identical across all backends (cutlass, cublaslt, cutedsl), confirming it is not a single-kernel regression.
Verified not the issue
- Weight dequant is correct — E2M1 lookup + scale expansion verified numerically: range -0.54~+0.54, mean 0.000002, no NaN/Inf.
- Input fp4_quantize is correct —
torch.ops.trtllm.fp4_quantize produces valid FP4 tensors.
- SM120 architecture is compiled —
cuobjdump confirms sm_120 cubins in all relevant .so files.
- Fixed dtype mismatch —
weight_scale dtype is float8_e4m3fn in safetensors; nvfp4_gemm expects uint8. Conversion path exists in model loading but the kernel itself still produces wrong results.
Impact
Model initializes and generates without crashes, but output is degenerate (single-token repetition: [d。。。。22222...] or [甲方<<<...]). The model cannot produce coherent text when using nvfp4_gemm on SM120.
Workaround
An architecture-aware fallback is implemented at the Python level:
- Detect
torch.cuda.get_device_capability() == (12, 0)
- Dequantize NVFP4 weights to BF16 on-the-fly using E2M1 lookup + scale expansion
- Use
torch.matmul instead of nvfp4_gemm
- Disable Fp4QuantizedTensor activation wrapping to avoid NaN in activation dequant
Branch: feat/blackwell-sm120-nvfp4-fallback
Performance: ~0.6 tok/s (Python dequant bottleneck, acceptable for validation)
Output: diverse tokens (real Chinese + English words), semi-gibberish due to numerical accumulation
Discussion
This appears to be a kernel-level SM120 compatibility issue. The expected fix in PR #4821 (TRT-LLM v1.4) should resolve it. This issue provides real-hardware validation data for that fix.
Full Diagnostic Script
See /tmp/test_precise_gemm.py in the feature branch for the step-by-step layer-level diagnostic that produced the correlation data above.
Title: nvfp4_gemm produces incorrect numerical results on SM120 (Blackwell consumer)
GPU: RTX 5060 Ti (SM 120, Blackwell, 16 GB)
TRT-LLM: main@7021547 (2025-05-15)
CUDA: 13.2 / Driver 595.58.03
Model: Qwen3-8B-NVFP4 (ModelOpt-quantized)
Reproducer: https://github.com/QilinWan/TensorRT-LLM/tree/feat/blackwell-sm120-nvfp4-fallback
Summary
torch.ops.trtllm.nvfp4_gemmproduces numerically incorrect results on SM120 (Blackwell consumer GPUs). All backends (CUTLASS, cuBLASLt, CuteDSL) are affected identically.Evidence
Comparing nvfp4_gemm output to a BF16 dequant reference on the same layer:
Correlation and relative error were identical across all backends (
cutlass,cublaslt,cutedsl), confirming it is not a single-kernel regression.Verified not the issue
torch.ops.trtllm.fp4_quantizeproduces valid FP4 tensors.cuobjdumpconfirms sm_120 cubins in all relevant.sofiles.weight_scaledtype isfloat8_e4m3fnin safetensors; nvfp4_gemm expectsuint8. Conversion path exists in model loading but the kernel itself still produces wrong results.Impact
Model initializes and generates without crashes, but output is degenerate (single-token repetition:
[d。。。。22222...]or[甲方<<<...]). The model cannot produce coherent text when using nvfp4_gemm on SM120.Workaround
An architecture-aware fallback is implemented at the Python level:
torch.cuda.get_device_capability() == (12, 0)torch.matmulinstead of nvfp4_gemmBranch:
feat/blackwell-sm120-nvfp4-fallbackPerformance: ~0.6 tok/s (Python dequant bottleneck, acceptable for validation)
Output: diverse tokens (real Chinese + English words), semi-gibberish due to numerical accumulation
Discussion
This appears to be a kernel-level SM120 compatibility issue. The expected fix in PR #4821 (TRT-LLM v1.4) should resolve it. This issue provides real-hardware validation data for that fix.
Full Diagnostic Script
See
/tmp/test_precise_gemm.pyin the feature branch for the step-by-step layer-level diagnostic that produced the correlation data above.