-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Open
Description
After benchmarking the current activation quantization approach suggested by @awni in #3122, I spent last couple days exploring quantized matrix multiplication (QMM). I started with a cuBLAS-based implementation as a baseline, then built a custom Tensor Core kernel that achieves significantly better results.
Key findings
- AQ(Actvation Quant) + Tensor Core QMM achieves within 6% of AQ-only (main branch) prompt throughput.
- The custom Tensor Core kernel reaches ~50 TFLOPS for quantized matmul, a 2.5x improvement over the ~20 TFLOPS naive approach.
- The cuBLAS QMM was notably slower, halving prompt throughput and adding ~2 GB peak memory.
- mxfp4 now works and cuts memory nearly in half (9.7 GB) and delivers 21% faster generation than mxfp8.
How it works
Instead of dequantizing weights to full precision then doing matmul (two passes), the kernel dequantizes and computes in a single fused pass using Tensor Cores.
Benchmarks
GPU: RTX 6000 Pro Workstation (96GB VRAM)
Model: Ministral-3-14B
mxfp8
| Approach | Prompt (tok/s) | Generation (tok/s) | Peak Memory (GB) |
|---|---|---|---|
| AQ only (main branch) | 2,458 | 101 | 16.3 |
| QMM (cuBLAS) | 928 | 87 | 18.3 |
| AQ + QMM (cuBLAS) | 1,002 | 91 | 16.3 |
| QMM (Tensor Core) | 1,518 | 94 | 16.6 |
| AQ + Tensor Core QMM | 2,320 | 101 | 16.3 |
mxfp4
| Approach | Prompt (tok/s) | Generation (tok/s) | Peak Memory (GB) |
|---|---|---|---|
| QMM (Tensor Core) | 1,703 | 123 | 9.7 |
If these results look promising, happy to clean this up and send a PR.
I would love to have this be my first PR to mlx-core.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels