Support float16 quantized matmul

Connor1996 · Connor1996 · commit cc850bc337ed · 2026-05-28T02:20:29.000-07:00
Signed-off-by: Connor1996 &lt;zbk602423539@gmail.com&gt;
diff --git a/book/src/week2-02-quantized-matmul.md b/book/src/week2-02-quantized-matmul.md
@@ -43,7 +43,8 @@ We're using only ~4% of available compute!
 
 ### The Solution: Quantization
 
-By compressing weights from 16 bits (bfloat16) to 4 bits (int4), we:
+By compressing weights from 16-bit floating point (`float16` or `bfloat16`) to
+4-bit integers (int4), we:
 
 - **Reduce memory bandwidth by 4×**: 880 MB → ~220 MB per token
 - **Improve arithmetic intensity by 4×**: 1.0 → ~4.0 FLOPs/Byte
@@ -58,7 +59,7 @@ Instead of quantizing all weights uniformly, we divide them into **groups** and
 For a weight matrix $W$ of shape $(K, N)$, we divide each row into groups of size $G$. In this course we use Qwen3 MLX 4-bit weights, whose group size is fixed at 128:
 
 ```plain
-Original weight matrix W: K × N (bfloat16)
+Original weight matrix W: K × N (float16 or bfloat16)
 
 Group size: G = 128
 Number of groups per row = N / G
@@ -69,7 +70,10 @@ For each group of G consecutive values in a row:
   3. Quantize each value using: quantized = round((value - bias) / scale)
 ```
 
-All quantized matmul tests use `group_size = 128`, matching the Qwen3 MLX 4-bit weights used by the rest of the course.
+All quantized matmul tests use `group_size = 128`, matching the Qwen3 MLX
+4-bit weights used by the rest of the course. The tests cover both `float16`
+and `bfloat16` because different MLX checkpoints store their scales, biases,
+and activations in different 16-bit dtypes.
 
 ### Affine Quantization
 
@@ -119,15 +123,15 @@ Quantized: [0, 2, 7, 10, 15] (4 bits each)
 For efficient storage and computation, quantized weights are packed:
 
 ```plain
-Original: K × N bfloat16 (2 bytes each) = 2KN bytes
+Original: K × N float16/bfloat16 (2 bytes each) = 2KN bytes
 Quantized: K × N int4 (0.5 bytes each) = 0.5KN bytes
 
 Packing: 8 × 4-bit values fit in one uint32 (32 bits)
 
 Weight matrix shape: K × N
 Quantized storage shape: K × (N / 8) uint32
-Scales shape: K × (N / G) bfloat16
-Biases shape: K × (N / G) bfloat16
+Scales shape: K × (N / G) float16/bfloat16
+Biases shape: K × (N / G) float16/bfloat16
 ```
 
 Example packing for 8 consecutive 4-bit values `[a, b, c, d, e, f, g, h]`:
@@ -150,9 +154,9 @@ Unpacking:
 
 For standard matrix multiplication $C = AB^T$ where:
 
-- $A$: shape $(M, N)$, bfloat16 (activations)
+- $A$: shape $(M, N)$, float16 or bfloat16 (activations)
 - $B$: shape $(K, N)$, **quantized** to int4 (weights)
-- $C$: shape $(M, K)$, bfloat16 (output)
+- $C$: shape $(M, K)$, same 16-bit dtype as $A$ (output)
 
 Each element $C[i, k]$ is computed as:
 
@@ -186,13 +190,13 @@ This shows we can factor out the scale and bias per group, reducing the number o
 
 ```plain
 Input:
-  A: M × N (bfloat16, activations)
+  A: M × N (float16 or bfloat16, activations)
   B_quantized: K × (N/8) (uint32, packed weights)
-  scales: K × (N/G) (bfloat16)
-  biases: K × (N/G) (bfloat16)
+  scales: K × (N/G) (same dtype as A)
+  biases: K × (N/G) (same dtype as A)
 
 Output:
-  C: M × K (bfloat16)
+  C: M × K (same dtype as A)
 
 For each output element C[i, k]:
   sum = 0  # float accumulator
@@ -211,7 +215,7 @@ For each output element C[i, k]:
         a_value = A[i, g*G + p*8 + bit_offset/4]
         sum = sum + a_value * b_value
   
-  C[i, k] = bfloat16(sum)
+  C[i, k] = same_dtype_as_A(sum)
 ```
 
 ## Task 1: Implement QuantizedWeights
@@ -225,8 +229,8 @@ First, familiarize yourself with the `QuantizedWeights` class, which stores quan
 | Field | Shape | Description |
 |-------|-------|-------------|
 | `weight` | $(K, N/8)$ uint32 | Packed quantized weights. Each uint32 stores 8 consecutive 4-bit values. The original weight matrix has shape $(K, N)$, and after packing, it becomes $(K, N/8)$. |
-| `scales` | $(K, N/G)$ bfloat16 | Per-group scale factors for dequantization. Each group of $G$ consecutive values shares one scale. Recall: $\text{scale} = (v_{max} - v_{min}) / 15$ |
-| `biases` | $(K, N/G)$ bfloat16 | Per-group bias (offset) for dequantization. Recall: $\text{bias} = v_{min}$ |
+| `scales` | $(K, N/G)$ float16/bfloat16 | Per-group scale factors for dequantization. Each group of $G$ consecutive values shares one scale. Recall: $\text{scale} = (v_{max} - v_{min}) / 15$ |
+| `biases` | $(K, N/G)$ float16/bfloat16 | Per-group bias (offset) for dequantization. Recall: $\text{bias} = v_{min}$ |
 | `group_size` | int | Number of consecutive values that share the same scale/bias. For the Qwen3 MLX 4-bit weights used here, this is `128`. |
 | `bits` | int | Quantization bit width (typically 4, meaning values are in range $[0, 15]$) |
 
@@ -251,7 +255,11 @@ You need to touch three files, all within the `tiny_llm_ext` namespace:
 - **`bindings.cpp`** — Add an `m.def(...)` call to expose the function to Python.
 - **`quantized_matmul.cpp`** — Implement the `quantized_matmul(...)` function (validate inputs, compute output shape, return a lazy `mx::array`) and the `eval_cpu` method (allocate output, register arrays with the CPU encoder, dispatch the compute kernel).
 
-The `eval_cpu` implementation follows the same CPU encoder pattern as `axpby`: allocate output memory with `out.set_data(mx::allocator::malloc(out.nbytes()))`, register input/output arrays with the encoder, then dispatch a lambda that performs the actual computation. Inside the lambda, implement the nested loop from the Computation Flow section above — iterate over each output element `(i, k)`, dequantize each packed value, accumulate the products in `float`, and write the `bfloat16` result to the output.
+The `eval_cpu` implementation follows the same CPU encoder pattern as `axpby`: allocate output memory with `out.set_data(mx::allocator::malloc(out.nbytes()))`, register input/output arrays with the encoder, then dispatch a lambda that performs the actual computation. Inside the lambda, implement the nested loop from the Computation Flow section above — iterate over each output element `(i, k)`, dequantize each packed value, accumulate the products in `float`, and write the result back as either `float16` or `bfloat16`, matching the input dtype.
+
+Follow the `axpby` dtype-dispatch pattern here: write the CPU implementation as
+a template, then dispatch with `mx::float16_t` or `mx::bfloat16_t` based on the
+output dtype.
 
 Don't forget to add `src/quantized_matmul.cpp` to `target_sources` in `CMakeLists.txt`.
 
@@ -276,22 +284,24 @@ In this task, you will write the Metal kernel for quantized matmul **and** wire
 You need to implement one kernel entry in `quantized_matmul.metal`:
 
 - Use a **one-thread-per-output-element** mapping: each thread computes `out[i, k]`.
-- The kernel should use `bfloat16_t` inputs and outputs.
+- The kernel should support both `half` and `bfloat16_t` inputs and outputs.
 - Apply the same group-wise dequantization loop as the CPU version:
   - Iterate over groups of 128 values
   - Unpack int4 values from packed `uint32`
   - Dequantize with `q * scale + bias`
-  - Accumulate the products in `float` and cast the final output back to `bfloat16_t`
+  - Accumulate the products in `float` and cast the final output back to the kernel dtype
 - Add boundary checks (`i < M`, `k < K`) before writing output.
 
 The custom kernel only needs to handle `bits = 4` and `group_size = 128`. Use that group size to compute `groups_per_row` and the packed weight offsets.
+Instantiate the same templated Metal kernel twice, once for `half` and once for
+`bfloat16_t`, and select the matching kernel name in `eval_gpu`.
 
 ### GPU Dispatch
 
 Complete the `eval_gpu` method in `quantized_matmul.cpp` to dispatch your Metal kernel. Follow the same pattern as `axpby`'s GPU dispatch:
 
 1. Get the Metal device and command encoder from the stream.
-2. Load the bfloat16 quantized matmul kernel from the Metal library.
+2. Load the quantized matmul kernel matching the output dtype from the Metal library.
 3. Set input/output buffers and dimension constants (`M`, `N`, `K`) on the encoder — make sure the buffer order matches your kernel signature.
 4. Calculate a 2D thread group configuration: use `kernel->maxTotalThreadsPerThreadgroup()` to determine the total threads, then split between the M and K dimensions (e.g., 32 threads for M, the rest for K).
 5. Dispatch with `dispatchThreadgroups`.
@@ -311,9 +321,9 @@ src/tiny_llm/qwen3_week2.py
 
 Integrate your quantized matmul into the Week 2 Qwen3 model so that inference runs on quantized weights end-to-end.
 
-Change the weight type from `mx.array` to `QuantizedWeights` for all linear layers in attention (`wq/wk/wv/wo`) and MLP (`w_gate/w_up/w_down`). Replace every `linear(x, w)` call with `quantized_linear(x, w)`. In the model loading code, use `QuantizedWeights.from_mlx_layer(...)` to extract quantized weight information from each MLX linear layer, instead of calling `mx.dequantize` to get a full bfloat16 matrix. Make sure the Week 1 loader still dequantizes (since Week 1 layers expect plain `mx.array`), while the Week 2 loader does **not** dequantize.
+Change the weight type from `mx.array` to `QuantizedWeights` for all linear layers in attention (`wq/wk/wv/wo`) and MLP (`w_gate/w_up/w_down`). Replace every `linear(x, w)` call with `quantized_linear(x, w)`. In the model loading code, use `QuantizedWeights.from_mlx_layer(...)` to extract quantized weight information from each MLX linear layer, instead of calling `mx.dequantize` to get a full 16-bit matrix. Make sure the Week 1 loader still dequantizes (since Week 1 layers expect plain `mx.array`), while the Week 2 loader does **not** dequantize.
 
-Qwen3 MLX quantized layers use **bfloat16** for the tensors involved in dequantization. Your kernel should take `scales`, `biases`, and activations as bfloat16. If you see `nan` or garbage output, a dtype mismatch is the most likely cause.
+Qwen3 MLX quantized layers may use **float16** or **bfloat16** for the tensors involved in dequantization. Your kernel should accept `scales`, `biases`, and activations in either dtype, require them to match, and return the same dtype. If you see `nan` or garbage output, a dtype mismatch is the most likely cause.
 
 Also keep the quantized layer's parameters. The model code should pass through `w.group_size` and `w.bits`; the extension should validate that they match the Qwen3 course assumptions: `group_size = 128` and `bits = 4`.
 
diff --git a/src/extensions_ref/src/quantized_matmul.cpp b/src/extensions_ref/src/quantized_matmul.cpp
@@ -23,8 +23,8 @@ mx::array quantized_matmul(const mx::array &scales,         // Input array scale
                            const bool transpose_b,          // Whether to transpose b
                            mx::StreamOrDevice s /* = {} */  // Stream on which to schedule the operation
 ) {
-    if (scales.dtype() != mx::bfloat16) {
-        throw std::runtime_error("quantized_matmul: scales must be bfloat16");
+    if (scales.dtype() != mx::float16 && scales.dtype() != mx::bfloat16) {
+        throw std::runtime_error("quantized_matmul: scales must be float16 or bfloat16");
     }
     if (scales.dtype() != biases.dtype()) {
         throw std::runtime_error("quantized_matmul: scales and biases must be the same dtype");
@@ -81,6 +81,7 @@ mx::array quantized_matmul(const mx::array &scales,         // Input array scale
         /* const std::vector<mx::array>& inputs = */ {scales, biases, a, b});
 }
 
+template <typename T>
 void quantized_matmul_impl(const mx::array &scales, const mx::array &biases, const mx::array &a, const mx::array &b,
                            mx::array &out, mx::Stream stream) {
     out.set_data(mx::allocator::malloc(out.nbytes()));
@@ -99,7 +100,7 @@ void quantized_matmul_impl(const mx::array &scales, const mx::array &biases, con
         throw std::runtime_error("quantized_matmul: b must be contiguous");
     }
 
-    encoder.dispatch([out_ptr = out.data<mx::bfloat16_t>(), out_shape = out.shape(), out_strides = out.strides(),
+    encoder.dispatch([out_ptr = out.data<T>(), out_shape = out.shape(), out_strides = out.strides(),
                       a = mx::array::unsafe_weak_copy(a), b = mx::array::unsafe_weak_copy(b),
                       scales = mx::array::unsafe_weak_copy(scales), biases = mx::array::unsafe_weak_copy(biases)]() {
         int M = a.shape()[0];
@@ -108,10 +109,10 @@ void quantized_matmul_impl(const mx::array &scales, const mx::array &biases, con
         const int group_size = 128;
         const int bits = 4;
         const int group_per_row = N / group_size;
-        const mx::bfloat16_t *a_ptr = a.data<mx::bfloat16_t>();
+        const T *a_ptr = a.data<T>();
         const uint32_t *b_ptr = b.data<uint32_t>();
-        const mx::bfloat16_t *scales_ptr = scales.data<mx::bfloat16_t>();
-        const mx::bfloat16_t *biases_ptr = biases.data<mx::bfloat16_t>();
+        const T *scales_ptr = scales.data<T>();
+        const T *biases_ptr = biases.data<T>();
         uint32_t item_mask = (1 << bits) - 1;
         for (int i = 0; i < M; i++) {
             for (int k = 0; k < K; k++) {
@@ -121,8 +122,8 @@ void quantized_matmul_impl(const mx::array &scales, const mx::array &biases, con
                         mx::elem_to_loc(k * group_per_row + group_idx, scales.shape(), scales.strides());
                     int64_t biases_loc =
                         mx::elem_to_loc(k * group_per_row + group_idx, biases.shape(), biases.strides());
-                    mx::bfloat16_t scale = scales_ptr[scales_loc];
-                    mx::bfloat16_t bias = biases_ptr[biases_loc];
+                    T scale = scales_ptr[scales_loc];
+                    T bias = biases_ptr[biases_loc];
                     int64_t b_loc = mx::elem_to_loc((k * N + group_idx * group_size) / 8, b.shape(), b.strides());
                     int64_t a_loc = mx::elem_to_loc(i * N + group_idx * group_size, a.shape(), a.strides());
                     const int packs_per_item = 32 / bits;
@@ -140,7 +141,7 @@ void quantized_matmul_impl(const mx::array &scales, const mx::array &biases, con
                     }
                 }
                 int64_t out_loc = mx::elem_to_loc(i * K + k, out_shape, out_strides);
-                out_ptr[out_loc] = static_cast<mx::bfloat16_t>(sum);
+                out_ptr[out_loc] = static_cast<T>(sum);
             }
         }
     });
@@ -153,7 +154,13 @@ void QuantizedMatmul::eval_cpu(const std::vector<mx::array> &inputs, std::vector
     auto &b = inputs[3];
     auto &out = outputs[0];
 
-    quantized_matmul_impl(scales, biases, a, b, out, stream());
+    if (out.dtype() == mx::float16) {
+        return quantized_matmul_impl<mx::float16_t>(scales, biases, a, b, out, stream());
+    } else if (out.dtype() == mx::bfloat16) {
+        return quantized_matmul_impl<mx::bfloat16_t>(scales, biases, a, b, out, stream());
+    } else {
+        throw std::runtime_error("quantized_matmul: output must be float16 or bfloat16");
+    }
 }
 
 void QuantizedMatmul::eval_gpu(const std::vector<mx::array> &inputs, std::vector<mx::array> &outputs) {
@@ -169,7 +176,8 @@ void QuantizedMatmul::eval_gpu(const std::vector<mx::array> &inputs, std::vector
 
     // Make a kernel from this metal library
     auto library = d.get_library("tiny_llm_ext_ref");
-    const char* kernel_name = "quantized_matmul_w4a16_g128_bf16";
+    const char* kernel_name = out.dtype() == mx::float16 ? "quantized_matmul_w4a16_g128_f16"
+                                                         : "quantized_matmul_w4a16_g128_bf16";
     auto kernel = d.get_kernel(kernel_name, library);
 
     // Prepare to encode kernel
diff --git a/src/extensions_ref/src/quantized_matmul.metal b/src/extensions_ref/src/quantized_matmul.metal
@@ -51,4 +51,5 @@ template <typename T>
     }
 }
 
+instantiate_kernel("quantized_matmul_w4a16_g128_f16", quantized_matmul_w4a16_g128, half);
 instantiate_kernel("quantized_matmul_w4a16_g128_bf16", quantized_matmul_w4a16_g128, bfloat16_t);
diff --git a/src/extensions_ref/test.py b/src/extensions_ref/test.py
@@ -2,16 +2,17 @@
 import mlx.core as mx
 import numpy as np
 
-input = mx.array(np.random.randn(3, 128)).astype(mx.bfloat16)
-weight = mx.array(np.random.randn(5, 128)).astype(mx.bfloat16)
-w_q, scales, biases = mx.quantize(weight, group_size=128, bits=4)
-user_out = quantized_matmul(
-    scales=scales,
-    biases=biases,
-    group_size=128,
-    bits=4,
-    a=input,
-    b=w_q,
-    transpose_b=True,
-)
-print(user_out)
+for dtype in (mx.float16, mx.bfloat16):
+    input = mx.array(np.random.randn(3, 128)).astype(dtype)
+    weight = mx.array(np.random.randn(5, 128)).astype(dtype)
+    w_q, scales, biases = mx.quantize(weight, group_size=128, bits=4)
+    user_out = quantized_matmul(
+        scales=scales,
+        biases=biases,
+        group_size=128,
+        bits=4,
+        a=input,
+        b=w_q,
+        transpose_b=True,
+    )
+    print(dtype, user_out)
diff --git a/tests_refsol/test_week_2_day_2.py b/tests_refsol/test_week_2_day_2.py
@@ -1,16 +1,20 @@
 import mlx.core as mx
-from .tiny_llm_base import *
-from .utils import *
+from .tiny_llm_base import quantized_matmul
+from .utils import assert_allclose
 
 
-def quantized_matmul_helper(stream: mx.Stream, identity_matrix: bool):
+def quantized_matmul_helper(
+    stream: mx.Stream,
+    precision: mx.Dtype,
+    identity_matrix: bool,
+):
     with mx.stream(stream):
         group_size = 128
         if identity_matrix:
-            input = mx.eye(group_size, dtype=mx.bfloat16)
+            input = mx.eye(group_size, dtype=precision)
         else:
-            input = mx.random.normal(shape=(3, group_size), dtype=mx.bfloat16)
-        weight = mx.random.normal(shape=(5, group_size), dtype=mx.bfloat16)
+            input = mx.random.normal(shape=(3, group_size), dtype=precision)
+        weight = mx.random.normal(shape=(5, group_size), dtype=precision)
         w_q, scales, biases = mx.quantize(weight, group_size=group_size, bits=4)
         user_out = quantized_matmul(
             scales=scales,
@@ -31,28 +35,44 @@ def quantized_matmul_helper(stream: mx.Stream, identity_matrix: bool):
             transpose=True,
         )
         if identity_matrix:
-            assert_allclose(user_out, ref_out, mx.bfloat16)
+            assert_allclose(user_out, ref_out, precision)
         else:
             assert_allclose(
                 user_out,
                 ref_out,
-                mx.bfloat16,
+                precision,
                 atol=5.0e-1,
-                message="quantized matmul bf16 comparison",
+                message=f"quantized matmul {precision} comparison",
             )
 
 
 def test_task_2_quantized_matmul_simple_bf16_cpu():
-    quantized_matmul_helper(mx.cpu, True)
+    quantized_matmul_helper(mx.cpu, mx.bfloat16, True)
 
 
 def test_task_2_quantized_matmul_complex_bf16_cpu():
-    quantized_matmul_helper(mx.cpu, False)
+    quantized_matmul_helper(mx.cpu, mx.bfloat16, False)
+
+
+def test_task_2_quantized_matmul_simple_f16_cpu():
+    quantized_matmul_helper(mx.cpu, mx.float16, True)
+
+
+def test_task_2_quantized_matmul_complex_f16_cpu():
+    quantized_matmul_helper(mx.cpu, mx.float16, False)
 
 
 def test_task_3_quantized_matmul_simple_bf16_gpu():
-    quantized_matmul_helper(mx.gpu, True)
+    quantized_matmul_helper(mx.gpu, mx.bfloat16, True)
 
 
 def test_task_3_quantized_matmul_complex_bf16_gpu():
-    quantized_matmul_helper(mx.gpu, False)
+    quantized_matmul_helper(mx.gpu, mx.bfloat16, False)
+
+
+def test_task_3_quantized_matmul_simple_f16_gpu():
+    quantized_matmul_helper(mx.gpu, mx.float16, True)
+
+
+def test_task_3_quantized_matmul_complex_f16_gpu():
+    quantized_matmul_helper(mx.gpu, mx.float16, False)

Original file line number	Diff line number	Diff line change
`@@ -51,4 +51,5 @@ template <typename T>`
`51`	`51`	`}`
`52`	`52`	`}`
`53`	`53`
	`54`	`+instantiate_kernel("quantized_matmul_w4a16_g128_f16", quantized_matmul_w4a16_g128, half);`
`54`	`55`	`instantiate_kernel("quantized_matmul_w4a16_g128_bf16", quantized_matmul_w4a16_g128, bfloat16_t);`