Skip to content

Commit 0a58e11

Browse files
committed
unprivileged/integrated-matrix: Add arithmetic considerations section
Specify when intermediate rounding is permitted during the K_eff-deep accumulation of a matrix multiply-accumulate instruction. For widening instructions (W=2, W=4), define the sub-dot-product as the W products of (SEW/W)-bit elements within one SEW-wide slot and note that each individual product is exact at SEW precision. For floating-point: the implementation partitions the λ×LMUL sub-dot-products into groups of G (power-of-two, 1 ≤ G ≤ λ), accumulates each group at ≥ 2×SEW internal precision, then rounds once and adds to C. G is implementation-defined, allowing both systolic (G=1) and outer-product (G ≈ λ) datapaths. Bit-exact reproducibility across implementations is explicitly not guaranteed. For integer: modular (wrapping) arithmetic makes the result uniquely defined regardless of accumulation order.
1 parent 6e4baa0 commit 0a58e11

File tree

1 file changed

+57
-0
lines changed

1 file changed

+57
-0
lines changed

src/integrated-matrix.adoc

Lines changed: 57 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -165,6 +165,63 @@ For integer multiply-accumulate instructions, `altfmt_A` and `altfmt_B` select w
165165

166166
=== Storage formats
167167

168+
==== Arithmetic considerations
169+
170+
Each multiply-accumulate instruction computes, for every output element C[m, n]:
171+
172+
C[m, n] ← C[m, n] + Σ_{k=0}^{K_eff−1} A[m, k] × B[k, n]
173+
174+
where K_eff = λ × W × LMUL.
175+
This section specifies how the K_eff product terms may be grouped and when intermediate rounding is permitted.
176+
177+
===== Sub-dot-products
178+
179+
For widening instructions (W = 2 or W = 4), the W narrow input-element pairs packed within one accumulator-width (SEW-bit) position form a natural computational unit called a _sub-dot-product_: the W multiplications of (SEW÷W)-bit elements are performed and their products summed.
180+
181+
Because each individual product of two (SEW÷W)-bit values is _exact_ at SEW precision—the significand of the product has at most 2 × p~input~ − 1 bits, which fits in the wider SEW format—the sub-dot-product can be computed with very little loss at SEW or wider precision.
182+
183+
There are K_eff ÷ W = λ × LMUL sub-dot-products per output element.
184+
185+
For non-widening instructions (W = 1), each product of two SEW-bit values is exact at 2 × SEW bits; a sub-dot-product consists of a single product term.
186+
187+
===== Accumulation and rounding model (floating-point)
188+
189+
An implementation partitions the λ × LMUL sub-dot-products for each output element into consecutive groups of G sub-dot-products.
190+
191+
* Within each group, the G partial results are accumulated using internal precision that requires no rounding to SEW precision inside a group.
192+
193+
* After each group, the accumulated partial sum is rounded to C's precision (SEW) using an _implementation-defined_ rounding mode and _added to the running value of C[m, n]_.
194+
The rounding mode used for these intermediate additions is not required to match `frm`.
195+
196+
The value of G is _implementation-defined_ and may depend on SEW, W, λ, LMUL, and the microarchitecture.
197+
It must satisfy:
198+
199+
* G is a power of two;
200+
* 1 ≤ G ≤ λ.
201+
202+
The resulting number of rounding additions to C per output element is (λ × LMUL) ÷ G.
203+
204+
The final accumulation step—adding the fully reduced dot-product to the original value of C[m, n]—uses the dynamic rounding mode from `frm`.
205+
206+
[NOTE]
207+
====
208+
In a *systolic-array* datapath, G is typically 1: each sub-dot-product is rounded and added to C immediately, yielding λ × LMUL rounding additions per output element.
209+
210+
In an *outer-product* datapath, G is typically on the order of λ (e.g. λ, λ÷2, λ÷4, or λ÷8): multiple sub-dot-products are accumulated at extended internal precision before a single rounding and C addition, significantly reducing the number of expensive full-precision additions.
211+
212+
Software must not depend on a particular value of G.
213+
====
214+
215+
Because G and the intermediate rounding mode are implementation-defined, two conforming implementations may produce floating-point results that differ in the least-significant bits for identical inputs.
216+
Bit-exact reproducibility of floating-point matrix multiply-accumulate results across different implementations is therefore _not_ guaranteed.
217+
218+
Floating-point exception flags (inexact, overflow, underflow, invalid, etc.) are accumulated into `fflags`; the order in which individual exceptions are raised within a single instruction execution is implementation-defined.
219+
220+
===== Integer accumulation
221+
222+
For integer multiply-accumulate instructions, all intermediate results are reduced modulo 2^SEW^.
223+
Because modular addition is both associative and commutative, the final result is uniquely defined regardless of the accumulation order or grouping factor.
224+
168225
[#zvvmm,reftext=Matrix-multiplication instructions (integer)]
169226
=== Zvvmm: Extension for matrix-multiplication on vector-registers interpreted as 2D integer matrix-tiles
170227

0 commit comments

Comments
 (0)