You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* Extra tag to make image preview work
* Updated SAIL for tile loads/stores when (rs2) == 0
* Separate order-preserving vs transposing tile loads/stores
* Fixes based on feedback from initial IME TG internal review
Copy file name to clipboardExpand all lines: src/integrated-matrix.adoc
+3-3Lines changed: 3 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -12,7 +12,7 @@ Dedicated matrix-multiply accelerators often require new register state—separa
12
12
13
13
The Zvvm family of Integrated Matrix extensions (Zvvmm, Zvvfmm, Zvvmtls, Zvvmttls) takes a different approach: it accelerates matrix multiplication using _only_ the 32 × VLEN architected vector registers already defined by the RISC-V "V" vector extension.
14
14
By interpreting groups of existing vector registers as two-dimensional matrix tiles, the Zvvm family of Integrated Matrix extensions delivers high arithmetic density without introducing any new architected state.
15
-
We focus, in particular, on the computation of C ← A × B^T^ + C, where A (μ × λ), B (ν × λ) and C (μ × ν) are row-major matrix panels.
15
+
We focus, in particular, on the computation of C ← A × B^T^ + C, where A (σ × λ), B (σ × λ) and C (σ × σ) are row-major matrix panels.
16
16
17
17
The extensions are designed to support implementations spanning a wide range of microarchitectures and performance points: from small, embedded in-order cores targeting low-power and area-constrained applications, to large, high-performance out-of-order implementations targeting HPC and AI workloads.
18
18
A key design goal is that the same binary executes correctly—and achieves near-peak arithmetic throughput—across this entire range without recompilation.
@@ -1008,7 +1008,7 @@ Loads a 2D matrix tile from memory into the vector register group starting at `v
1008
1008
Let _linesize_ = λ × LMUL.
1009
1009
For each element index `i` in the body `[vstart, VL)` where the mask is enabled:
This instruction is the correct choice when A is stored in row-major order or when B is
1014
1014
stored in column-major order: in both cases the memory layout consists of _linesize_-element
@@ -1636,7 +1636,7 @@ The type suffix in a long-form multiply-accumulate intrinsic name directly encod
1636
1636
Either way, source code with IME intrinsics is tied to a specific combination of input/output types and value of MUL_C.
1637
1637
1638
1638
Althought it is possible to write more general assembly code, it is common industry practice to favor coding with compiler intrinsics.
1639
-
The recommended approach for writing portable code with IME intrinsics is to package multiple code paths in the same executable, each optimized for a specific value of C_MUL.
1639
+
The recommended approach for writing portable code with IME intrinsics is to package multiple code paths in the same executable, each optimized for a specific value of MUL_C.
1640
1640
Runtime selection of the appropriate code path is then performed based on the result of `vsetvl` and computations of MUL_C = VLEN / (SEW × λ²).
1641
1641
1642
1642
[#integrated-matrix-insns,reftext="Instructions (in alphabetical order)"]
0 commit comments