[Perf] mxfp4 quantize kernel is slow

FlashInfer supports nvfp4 and mxfp4 quantization kernels.

```
flashinfer$ python benchmarks/flashinfer_benchmark.py -R nvfp4_quantize --m 8192 --k 8192 --backends cuda
[PERF] cuda           :: median time 0.036 ms; std 0.001 ms; achieved tflops 5.597 TFLOPs/sec; achieved tb_per_sec 4.781 TB/sec
flashinfer$ python benchmarks/flashinfer_benchmark.py -R mxfp4_quantize --m 8192 --k 8192 --backends cuda
[PERF] cuda           :: median time 0.456 ms; std 0.001 ms; achieved tflops 0.442 TFLOPs/sec; achieved tb_per_sec 0.373 TB/sec
```
Even considering the difference in quantization scheme, the mxfp4 quantization kernel's performance is an order of magnitude slower than that of the nvfp4 quantization.

An update, ideally via CuTe DSL as in #2443 for mxfp4 quantization is desired

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Perf] mxfp4 quantize kernel is slow #2496

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Perf] mxfp4 quantize kernel is slow #2496

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions