Skip to content

FP_QMM proposal #3128

@Blaizzy

Description

@Blaizzy

After benchmarking the current activation quantization approach suggested by @awni in #3122, I spent last couple days exploring quantized matrix multiplication (QMM). I started with a cuBLAS-based implementation as a baseline, then built a custom Tensor Core kernel that achieves significantly better results.

Key findings

  • AQ(Actvation Quant) + Tensor Core QMM achieves within 6% of AQ-only (main branch) prompt throughput.
  • The custom Tensor Core kernel reaches ~50 TFLOPS for quantized matmul, a 2.5x improvement over the ~20 TFLOPS naive approach.
  • The cuBLAS QMM was notably slower, halving prompt throughput and adding ~2 GB peak memory.
  • mxfp4 now works and cuts memory nearly in half (9.7 GB) and delivers 21% faster generation than mxfp8.

How it works

Instead of dequantizing weights to full precision then doing matmul (two passes), the kernel dequantizes and computes in a single fused pass using Tensor Cores.

Benchmarks

GPU: RTX 6000 Pro Workstation (96GB VRAM)
Model: Ministral-3-14B

mxfp8

Approach Prompt (tok/s) Generation (tok/s) Peak Memory (GB)
AQ only (main branch) 2,458 101 16.3
QMM (cuBLAS) 928 87 18.3
AQ + QMM (cuBLAS) 1,002 91 16.3
QMM (Tensor Core) 1,518 94 16.6
AQ + Tensor Core QMM 2,320 101 16.3

mxfp4

Approach Prompt (tok/s) Generation (tok/s) Peak Memory (GB)
QMM (Tensor Core) 1,703 123 9.7

If these results look promising, happy to clean this up and send a PR.

I would love to have this be my first PR to mlx-core.

cc: @awni @zcbenz

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions