-
Notifications
You must be signed in to change notification settings - Fork 707
Open
Labels
Description
FlashInfer supports nvfp4 and mxfp4 quantization kernels.
flashinfer$ python benchmarks/flashinfer_benchmark.py -R nvfp4_quantize --m 8192 --k 8192 --backends cuda
[PERF] cuda :: median time 0.036 ms; std 0.001 ms; achieved tflops 5.597 TFLOPs/sec; achieved tb_per_sec 4.781 TB/sec
flashinfer$ python benchmarks/flashinfer_benchmark.py -R mxfp4_quantize --m 8192 --k 8192 --backends cuda
[PERF] cuda :: median time 0.456 ms; std 0.001 ms; achieved tflops 0.442 TFLOPs/sec; achieved tb_per_sec 0.373 TB/sec
Even considering the difference in quantization scheme, the mxfp4 quantization kernel's performance is an order of magnitude slower than that of the nvfp4 quantization.
An update, ideally via CuTe DSL as in #2443 for mxfp4 quantization is desired
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
Type
Projects
Status
In Progress