nvfp4_gemm produces incorrect numerical results on SM120 (Blackwell consumer)

### Title: nvfp4_gemm produces incorrect numerical results on SM120 (Blackwell consumer)

**GPU**: RTX 5060 Ti (SM 120, Blackwell, 16 GB)  
**TRT-LLM**: main@7021547 (2025-05-15)  
**CUDA**: 13.2 / Driver 595.58.03  
**Model**: Qwen3-8B-NVFP4 (ModelOpt-quantized)  
**Reproducer**: https://github.com/QilinWan/TensorRT-LLM/tree/feat/blackwell-sm120-nvfp4-fallback

---

### Summary

`torch.ops.trtllm.nvfp4_gemm` produces numerically incorrect results on SM120 (Blackwell consumer GPUs). All backends (CUTLASS, cuBLASLt, CuteDSL) are affected identically.

### Evidence

Comparing nvfp4_gemm output to a BF16 dequant reference on the same layer:

| Metric | BF16 reference | nvfp4_gemm | Verdict |
|--------|---------------|------------|---------|
| Correlation | 1.000 | **0.113** | Near-zero correlation |
| Relative error | 0% | **99%** | Complete mismatch |
| Output range | -6.4 ~ +6.3 | -3.2 ~ +4.7 | Wrong magnitude |

Correlation and relative error were **identical** across all backends (`cutlass`, `cublaslt`, `cutedsl`), confirming it is not a single-kernel regression.

### Verified not the issue

1. **Weight dequant is correct** — E2M1 lookup + scale expansion verified numerically: range -0.54~+0.54, mean 0.000002, no NaN/Inf.
2. **Input fp4_quantize is correct** — `torch.ops.trtllm.fp4_quantize` produces valid FP4 tensors.
3. **SM120 architecture is compiled** — `cuobjdump` confirms sm_120 cubins in all relevant `.so` files.
4. **Fixed dtype mismatch** — `weight_scale` dtype is `float8_e4m3fn` in safetensors; nvfp4_gemm expects `uint8`. Conversion path exists in model loading but the kernel itself still produces wrong results.

### Impact

Model initializes and generates without crashes, but output is degenerate (single-token repetition: `[d。。。。22222...]` or `[甲方<<<...]`). The model cannot produce coherent text when using nvfp4_gemm on SM120.

### Workaround

An architecture-aware fallback is implemented at the Python level:
- Detect `torch.cuda.get_device_capability() == (12, 0)`
- Dequantize NVFP4 weights to BF16 on-the-fly using E2M1 lookup + scale expansion
- Use `torch.matmul` instead of nvfp4_gemm
- Disable Fp4QuantizedTensor activation wrapping to avoid NaN in activation dequant

Branch: `feat/blackwell-sm120-nvfp4-fallback`  
Performance: ~0.6 tok/s (Python dequant bottleneck, acceptable for validation)  
Output: diverse tokens (real Chinese + English words), semi-gibberish due to numerical accumulation

### Discussion

This appears to be a kernel-level SM120 compatibility issue. The expected fix in PR #4821 (TRT-LLM v1.4) should resolve it. This issue provides real-hardware validation data for that fix.

### Full Diagnostic Script

See `/tmp/test_precise_gemm.py` in the feature branch for the step-by-step layer-level diagnostic that produced the correlation data above.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nvfp4_gemm produces incorrect numerical results on SM120 (Blackwell consumer) #14154

Title: nvfp4_gemm produces incorrect numerical results on SM120 (Blackwell consumer)

Summary

Evidence

Verified not the issue

Impact

Workaround

Discussion

Full Diagnostic Script

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Metric	BF16 reference	nvfp4_gemm	Verdict
Correlation	1.000	0.113	Near-zero correlation
Relative error	0%	99%	Complete mismatch
Output range	-6.4 ~ +6.3	-3.2 ~ +4.7	Wrong magnitude

nvfp4_gemm produces incorrect numerical results on SM120 (Blackwell consumer) #14154

Description

Title: nvfp4_gemm produces incorrect numerical results on SM120 (Blackwell consumer)

Summary

Evidence

Verified not the issue

Impact

Workaround

Discussion

Full Diagnostic Script

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions