Skip to content

[Perf] mxfp4 quantize kernel is slow #2496

@bkryu

Description

@bkryu

FlashInfer supports nvfp4 and mxfp4 quantization kernels.

flashinfer$ python benchmarks/flashinfer_benchmark.py -R nvfp4_quantize --m 8192 --k 8192 --backends cuda
[PERF] cuda           :: median time 0.036 ms; std 0.001 ms; achieved tflops 5.597 TFLOPs/sec; achieved tb_per_sec 4.781 TB/sec
flashinfer$ python benchmarks/flashinfer_benchmark.py -R mxfp4_quantize --m 8192 --k 8192 --backends cuda
[PERF] cuda           :: median time 0.456 ms; std 0.001 ms; achieved tflops 0.442 TFLOPs/sec; achieved tb_per_sec 0.373 TB/sec

Even considering the difference in quantization scheme, the mxfp4 quantization kernel's performance is an order of magnitude slower than that of the nvfp4 quantization.

An update, ideally via CuTe DSL as in #2443 for mxfp4 quantization is desired

Metadata

Metadata

Assignees

Type

No type

Projects

Status

In Progress

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions