Skip to content

Commit 8e56759

Browse files
committed
unprivileged/integrated-matrix: Add arithmetic considerations section
Specify when intermediate rounding is permitted during the K_eff-deep accumulation of a matrix multiply-accumulate instruction. For widening instructions (W=2, W=4), define the sub-dot-product as the W products of (SEW/W)-bit elements within one SEW-wide slot and note that each individual product is exact at SEW precision. For floating-point: the implementation partitions the λ×LMUL sub-dot-products into groups of G (power-of-two, 1 ≤ G ≤ λ), accumulates each group at ≥ 2×SEW internal precision, then rounds once and adds to C. G is implementation-defined, allowing both systolic (G=1) and outer-product (G ≈ λ) datapaths. Bit-exact reproducibility across implementations is explicitly not guaranteed. For integer: modular (wrapping) arithmetic makes the result uniquely defined regardless of accumulation order.
1 parent 735ba1d commit 8e56759

File tree

1 file changed

+57
-0
lines changed

1 file changed

+57
-0
lines changed

src/integrated-matrix.adoc

Lines changed: 57 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -160,6 +160,63 @@ For integer multiply-accumulate instructions, `altfmt_A` and `altfmt_B` select w
160160
| 1 | Unsigned
161161
|===
162162

163+
==== Arithmetic considerations
164+
165+
Each multiply-accumulate instruction computes, for every output element C[m, n]:
166+
167+
C[m, n] ← C[m, n] + Σ_{k=0}^{K_eff−1} A[m, k] × B[k, n]
168+
169+
where K_eff = λ × W × LMUL.
170+
This section specifies how the K_eff product terms may be grouped and when intermediate rounding is permitted.
171+
172+
===== Sub-dot-products
173+
174+
For widening instructions (W = 2 or W = 4), the W narrow input-element pairs packed within one accumulator-width (SEW-bit) position form a natural computational unit called a _sub-dot-product_: the W multiplications of (SEW÷W)-bit elements are performed and their products summed.
175+
176+
Because each individual product of two (SEW÷W)-bit values is _exact_ at SEW precision—the significand of the product has at most 2 × p~input~ − 1 bits, which fits in the wider SEW format—the sub-dot-product can be computed with very little loss at SEW or wider precision.
177+
178+
There are K_eff ÷ W = λ × LMUL sub-dot-products per output element.
179+
180+
For non-widening instructions (W = 1), each product of two SEW-bit values is exact at 2 × SEW bits; a sub-dot-product consists of a single product term.
181+
182+
===== Accumulation and rounding model (floating-point)
183+
184+
An implementation partitions the λ × LMUL sub-dot-products for each output element into consecutive groups of G sub-dot-products.
185+
186+
* Within each group, the G partial results are accumulated using internal precision that requires no rounding to SEW precision inside a group.
187+
188+
* After each group, the accumulated partial sum is rounded to C's precision (SEW) using an _implementation-defined_ rounding mode and _added to the running value of C[m, n]_.
189+
The rounding mode used for these intermediate additions is not required to match `frm`.
190+
191+
The value of G is _implementation-defined_ and may depend on SEW, W, λ, LMUL, and the microarchitecture.
192+
It must satisfy:
193+
194+
* G is a power of two;
195+
* 1 ≤ G ≤ λ.
196+
197+
The resulting number of rounding additions to C per output element is (λ × LMUL) ÷ G.
198+
199+
The final accumulation step—adding the fully reduced dot-product to the original value of C[m, n]—uses the dynamic rounding mode from `frm`.
200+
201+
[NOTE]
202+
====
203+
In a *systolic-array* datapath, G is typically 1: each sub-dot-product is rounded and added to C immediately, yielding λ × LMUL rounding additions per output element.
204+
205+
In an *outer-product* datapath, G is typically on the order of λ (e.g. λ, λ÷2, λ÷4, or λ÷8): multiple sub-dot-products are accumulated at extended internal precision before a single rounding and C addition, significantly reducing the number of expensive full-precision additions.
206+
207+
Software must not depend on a particular value of G.
208+
====
209+
210+
Because G and the intermediate rounding mode are implementation-defined, two conforming implementations may produce floating-point results that differ in the least-significant bits for identical inputs.
211+
Bit-exact reproducibility of floating-point matrix multiply-accumulate results across different implementations is therefore _not_ guaranteed.
212+
213+
Floating-point exception flags (inexact, overflow, underflow, invalid, etc.) are accumulated into `fflags`; the order in which individual exceptions are raised within a single instruction execution is implementation-defined.
214+
215+
===== Integer accumulation
216+
217+
For integer multiply-accumulate instructions, all intermediate results are reduced modulo 2^SEW^.
218+
Because modular addition is both associative and commutative, the final result is uniquely defined regardless of the accumulation order or grouping factor.
219+
163220
[#zvvmm,reftext=Matrix-multiplication instructions (integer)]
164221
=== Zvvmm: Extension for matrix-multiplication on vector-registers interpreted as 2D integer matrix-tiles
165222

0 commit comments

Comments
 (0)