Integrated matrix extension (#26)

joseemoreira · web-flow · commit b976f488b0e7 · 2026-03-16T10:15:57.000-04:00
* Extra tag to make image preview work

* Updated SAIL for tile loads/stores when (rs2) == 0

* Separate order-preserving vs transposing tile loads/stores

* Fixes based on feedback from initial IME TG internal review
diff --git a/src/integrated-matrix.adoc b/src/integrated-matrix.adoc
@@ -12,7 +12,7 @@ Dedicated matrix-multiply accelerators often require new register state—separa
 
 The Zvvm family of Integrated Matrix extensions (Zvvmm, Zvvfmm, Zvvmtls, Zvvmttls) takes a different approach: it accelerates matrix multiplication using _only_ the 32 × VLEN architected vector registers already defined by the RISC-V "V" vector extension.
 By interpreting groups of existing vector registers as two-dimensional matrix tiles, the Zvvm family of Integrated Matrix extensions delivers high arithmetic density without introducing any new architected state.
-We focus, in particular, on the computation of C ← A × B^T^ + C, where A (μ × λ), B (ν × λ) and C (μ × ν) are row-major matrix panels.
+We focus, in particular, on the computation of C ← A × B^T^ + C, where A (σ × λ), B (σ × λ) and C (σ × σ) are row-major matrix panels.
 
 The extensions are designed to support implementations spanning a wide range of microarchitectures and performance points: from small, embedded in-order cores targeting low-power and area-constrained applications, to large, high-performance out-of-order implementations targeting HPC and AI workloads.
 A key design goal is that the same binary executes correctly—and achieves near-peak arithmetic throughput—across this entire range without recompilation.
@@ -1008,7 +1008,7 @@ Loads a 2D matrix tile from memory into the vector register group starting at `v
 Let _linesize_ = λ × LMUL.
 For each element index `i` in the body `[vstart, VL)` where the mask is enabled:
 
-    VD[i] = M[rs1 + (SEW ÷ 8) × ((i / linesize) × LD + (i % linesizea))]
+    VD[i] = M[rs1 + (SEW ÷ 8) × ((i / linesize) × LD + (i % linesize))]
 
 This instruction is the correct choice when A is stored in row-major order or when B is
 stored in column-major order: in both cases the memory layout consists of _linesize_-element
@@ -1636,7 +1636,7 @@ The type suffix in a long-form multiply-accumulate intrinsic name directly encod
 Either way, source code with IME intrinsics is tied to a specific combination of input/output types and value of MUL_C.
 
 Althought it is possible to write more general assembly code, it is common industry practice to favor coding with compiler intrinsics.
-The recommended approach for writing portable code with IME intrinsics is to package multiple code paths in the same executable, each optimized for a specific value of C_MUL.
+The recommended approach for writing portable code with IME intrinsics is to package multiple code paths in the same executable, each optimized for a specific value of MUL_C.
 Runtime selection of the appropriate code path is then performed based on the result of `vsetvl` and computations of MUL_C = VLEN / (SEW × λ²).
 
 [#integrated-matrix-insns,reftext="Instructions (in alphabetical order)"]