Skip to content

Commit b976f48

Browse files
authored
Integrated matrix extension (#26)
* Extra tag to make image preview work * Updated SAIL for tile loads/stores when (rs2) == 0 * Separate order-preserving vs transposing tile loads/stores * Fixes based on feedback from initial IME TG internal review
1 parent 49d231e commit b976f48

File tree

1 file changed

+3
-3
lines changed

1 file changed

+3
-3
lines changed

src/integrated-matrix.adoc

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ Dedicated matrix-multiply accelerators often require new register state—separa
1212

1313
The Zvvm family of Integrated Matrix extensions (Zvvmm, Zvvfmm, Zvvmtls, Zvvmttls) takes a different approach: it accelerates matrix multiplication using _only_ the 32 × VLEN architected vector registers already defined by the RISC-V "V" vector extension.
1414
By interpreting groups of existing vector registers as two-dimensional matrix tiles, the Zvvm family of Integrated Matrix extensions delivers high arithmetic density without introducing any new architected state.
15-
We focus, in particular, on the computation of C ← A × B^T^ + C, where A (μ × λ), B (ν × λ) and C (μ × ν) are row-major matrix panels.
15+
We focus, in particular, on the computation of C ← A × B^T^ + C, where A (σ × λ), B (σ × λ) and C (σ × σ) are row-major matrix panels.
1616

1717
The extensions are designed to support implementations spanning a wide range of microarchitectures and performance points: from small, embedded in-order cores targeting low-power and area-constrained applications, to large, high-performance out-of-order implementations targeting HPC and AI workloads.
1818
A key design goal is that the same binary executes correctly—and achieves near-peak arithmetic throughput—across this entire range without recompilation.
@@ -1008,7 +1008,7 @@ Loads a 2D matrix tile from memory into the vector register group starting at `v
10081008
Let _linesize_ = λ × LMUL.
10091009
For each element index `i` in the body `[vstart, VL)` where the mask is enabled:
10101010

1011-
VD[i] = M[rs1 + (SEW ÷ 8) × ((i / linesize) × LD + (i % linesizea))]
1011+
VD[i] = M[rs1 + (SEW ÷ 8) × ((i / linesize) × LD + (i % linesize))]
10121012

10131013
This instruction is the correct choice when A is stored in row-major order or when B is
10141014
stored in column-major order: in both cases the memory layout consists of _linesize_-element
@@ -1636,7 +1636,7 @@ The type suffix in a long-form multiply-accumulate intrinsic name directly encod
16361636
Either way, source code with IME intrinsics is tied to a specific combination of input/output types and value of MUL_C.
16371637

16381638
Althought it is possible to write more general assembly code, it is common industry practice to favor coding with compiler intrinsics.
1639-
The recommended approach for writing portable code with IME intrinsics is to package multiple code paths in the same executable, each optimized for a specific value of C_MUL.
1639+
The recommended approach for writing portable code with IME intrinsics is to package multiple code paths in the same executable, each optimized for a specific value of MUL_C.
16401640
Runtime selection of the appropriate code path is then performed based on the result of `vsetvl` and computations of MUL_C = VLEN / (SEW × λ²).
16411641

16421642
[#integrated-matrix-insns,reftext="Instructions (in alphabetical order)"]

0 commit comments

Comments
 (0)