Hardware-accelerated QR decomposition for Apple Silicon, built on top of MLX and Metal, by leveraging the GPU to maximise performance.
Exposes a single function, custom_math::qr_accelerated, that accepts a batched MLX array and returns Q and R — automatically routing to the most efficient Metal kernel for the given matrix dimensions and batch size. See benchmark results for GPU vs CPU timings across a range of matrix shapes and batch sizes.
This library depends on the MLX C++ library. On macOS:
brew install mlx#include "qr.h"
// a: MLX array of shape [M, N] or [B, M, N] (float32 or auto-cast)
// Returns: {Q, R} where Q is [M, K] and R is [K, N], K = min(M, N)
auto [Q, R] = custom_math::qr_accelerated(a);The function operates on the default MLX device. Set it to GPU before calling:
#include "qr.h"
#include <mlx/mlx.h>
using namespace mlx::core;
// Build a random 4 x 4 matrix
std::vector<float> data = {
1, 2, 3, 4,
5, 6, 7, 8,
9, 10, 11, 12,
13, 14, 15, 16
};
array A(data.begin(), {4, 4}, float32);
set_default_device(Device::gpu);
auto [Q, R] = custom_math::qr_accelerated(A);
eval({Q, R});
// Q: [4, 4] orthogonal matrix
// R: [4, 4] upper-triangular matrix
// A ≈ Q * RGiven a matrix
where
Q
which can be stated compactly as
R
This is the thin (or reduced) QR decomposition. The full decomposition extends
For example, given:
A = [[ 1, 2, 3 ],
[ 4, 5, 6 ],
[ 7, 8, 9 ]]
the thin QR decomposition yields:
Q = [[-0.123, 0.904, 0.408 ], R = [[-8.124, -9.601, -11.078 ],
[-0.492, 0.301, -0.816 ], [ 0.0, 0.905, 1.809 ],
[-0.862, -0.301, 0.408 ]] [ 0.0, 0.0, 0.0 ]]
One can verify
The decomposition is fundamental to solving linear least-squares problems, performing Gram-Schmidt orthogonalisation, and as the core step in the QR algorithm for computing eigenvalues.
- G. H. Golub and C. F. Van Loan, Matrix Computations, 4th ed. Johns Hopkins University Press, 2013. §5.1 — Householder reflections and QR factorisation.
- R. Schreiber and C. Van Loan, "A storage-efficient WY representation for products of Householder transformations", SIAM Journal on Scientific and Statistical Computing, vol. 10, no. 1, pp. 53–57, 1989 — the Compact WY representation used for block updates.
Both shaders build
chosen so that
where
Applying
Since each
For a block of
where
This lets a full block update be expressed as a pair of matrix multiplications:
which maps directly onto the AMX matrix coprocessor's 8×8 simdgroup_matrix tiles.
This kernel processes the entire matrix in one GPU dispatch. It operates on matrices stored in column-major format to align with AMX load/store strides, and pads dimensions to multiples of 32 (rows) and 16 (columns) to eliminate in-kernel boundary branching.
For each block of
Step 1 — Panel factorisation. For each column
- All 1024 threads cooperatively compute
$|x_\text{tail}|^2$ via a two-phase threadgroup reduction (intra-SIMD viasimd_sum, then inter-SIMD via shared memory). - Thread 0 computes
$\mu$ ,$\tau$ , and the scale$1/(\alpha - \mu)$ and broadcasts them through threadgroup memory. - Each thread normalises its portion of the tail:
$A[r, k] \leftarrow A[r, k] / (\alpha - \mu)$ for$r > k$ . - The reflector is applied to the remaining
$b - k - 1$ columns of the panel: for each$j > k$ ,$a_j \leftarrow a_j - \tau (v_k^T a_j) v_k$ , with the dot product accumulated via threadgroup reduction.
Step 2 — Form T. The
Step 3 — Trailing matrix update. Each SIMD group owns a set of 8-column tiles of the trailing submatrix
Step 4 — Q accumulation. The same WY update is applied to
Step 5 — Restore diagonal. The stored
For large matrices (
The block size is widened to
Kernel 0 — Preprocess. The input is transposed from row-major to column-major and padded with identity blocks.
Kernel 1 — Panel factorisation. Dispatched with 1 threadgroup per matrix. For each column
Kernel 2 — T-matrix construction. Dispatched with 1 threadgroup per matrix. Reads the reflector columns from simdgroup_multiply_accumulate (which adds) rather than needing a subtract path.
Kernel 3 — Grid-parallel trailing update. Dispatched with one threadgroup per 32-column tile of the trailing submatrix. Each threadgroup independently computes:
Kernel 4 — Haar fix. Ensures the output
Two Metal kernels handle different regimes, with a dispatcher that selects between them at runtime:
qr_unblocked — Standard Householder QR in a single kernel dispatch. Used for smaller matrices where the overhead of multi-pass streaming is not worth it.
qr_streaming_amx — Multi-pass panel factorisation with grid-parallel trailing matrix updates, designed to saturate the GPU for large matrices. Factorises column panels of width 32, computes the T-matrix for each WY representation, then launches a grid of threadgroups for the trailing update.
| Condition | Kernel |
|---|---|
max(M, N) >= 512 |
qr_streaming_amx |
max(M, N) < 512 and batch >= 16 |
qr_unblocked |
max(M, N) >= 128 and batch < 16 |
qr_streaming_amx |
max(M, N) < 128 and batch < 16 |
qr_unblocked |
The crossover points reflect two competing costs: single-SM Q-accumulation becomes a bottleneck above ~512, while multi-kernel launch overhead makes streaming slower than unblocked for small matrices with few batch elements.
Both backends cache compiled MTLComputePipelineState objects and recycle GPU memory workspaces on repeated calls with the same shape, so warm invocations avoid both JIT compilation and OS-level buffer allocation.
- Apple Silicon Mac (M1 or later)
- macOS with Xcode command line tools (
xcrun,metal,metallib) - MLX installed and findable by CMake
- CMake 3.25+
- C++20
TODO: exportable library packaging is planned for a future release.
Clone the repository and build in release mode:
git clone https://github.com/cormaccinnsealach/qr-apple-silicon.git
cd qr-apple-silicon
cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build --target benchmark_qr./build/benchmark_qrThe benchmark sweeps every configuration automatically:
| Class | Configs (M × N) | Batch sizes |
|---|---|---|
| Small (M < 512) | 64×64, 128×64, 256×128, 512×256 | 100, 500, 1000, 5000, 10000, 15000 |
| Large (M ≥ 512) | 512×512, 1024×512, 5000×5000 | 1, 8, 16, 32 |
Per-config batch limits are set via the max_batch field in SMALL_CONFIGS at the top of tests/benchmark_qr.cpp. Any batch size above this value is skipped for that config and shown as -- in the table. Set max_batch = 0 to run a config at all batch sizes.
// tests/benchmark_qr.cpp
static const std::vector<BenchConfig> SMALL_CONFIGS = {
{ 64, 64},
{128, 64},
{256, 128, 5000}, // skipped for batch > 5000
{512, 256, 1000}, // skipped for batch > 1000 — raise or set to 0 if memory allows
};Each configuration runs for 5 timed repetitions (2 warmup discarded) and reports GPU time, CPU time (MLX CPU), speedup, and mean Frobenius reconstruction error ||QR - A||_F for both.