FP_QMM proposal

After benchmarking the current activation quantization approach suggested by @awni in #3122, I spent last couple days exploring quantized matrix multiplication (QMM). I started with a cuBLAS-based implementation as a baseline, then built a custom Tensor Core kernel that achieves significantly better results.

### Key findings

- **AQ(Actvation Quant) + Tensor Core QMM** achieves within 6% of AQ-only (main branch) prompt throughput.
- The custom Tensor Core kernel reaches **~50 TFLOPS** for quantized matmul, a **2.5x improvement** over the ~20 TFLOPS naive approach.
- The cuBLAS QMM was notably slower, halving prompt throughput and adding ~2 GB peak memory.
- **mxfp4** now works and cuts memory nearly in half (9.7 GB) and delivers **21% faster generation** than mxfp8.

### How it works
Instead of dequantizing weights to full precision then doing matmul (two passes), the kernel dequantizes and computes in a single fused pass using Tensor Cores.

### Benchmarks

GPU: RTX 6000 Pro Workstation (96GB VRAM)
Model: Ministral-3-14B

**mxfp8**

| Approach | Prompt (tok/s) | Generation (tok/s) | Peak Memory (GB) |
|---|---|---|---|
| AQ only (main branch) | 2,458 | 101 | 16.3 |
| QMM (cuBLAS) | 928 | 87 | 18.3 |
| AQ + QMM (cuBLAS) | 1,002 | 91 | 16.3 |
| QMM (Tensor Core) | 1,518 | 94 | 16.6 |
| **AQ + Tensor Core QMM** | **2,320** | **101** | **16.3** |

**mxfp4**

| Approach | Prompt (tok/s) | Generation (tok/s) | Peak Memory (GB) |
|---|---|---|---|
| QMM (Tensor Core) | 1,703 | 123 | 9.7 |


If these results look promising, happy to clean this up and send a PR.

I would love to have this be my first PR to mlx-core.

cc: @awni @zcbenz 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FP_QMM proposal #3128

Key findings

How it works

Benchmarks

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Approach	Prompt (tok/s)	Generation (tok/s)	Peak Memory (GB)
AQ only (main branch)	2,458	101	16.3
QMM (cuBLAS)	928	87	18.3
AQ + QMM (cuBLAS)	1,002	91	16.3
QMM (Tensor Core)	1,518	94	16.6
AQ + Tensor Core QMM	2,320	101	16.3

FP_QMM proposal #3128

Description

Key findings

How it works

Benchmarks

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions