You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Specify when intermediate rounding is permitted during the K_eff-deep
accumulation of a matrix multiply-accumulate instruction.
For widening instructions (W=2, W=4), define the sub-dot-product as
the W products of (SEW/W)-bit elements within one SEW-wide slot and
note that each individual product is exact at SEW precision.
For floating-point: the implementation partitions the λ×LMUL
sub-dot-products into groups of G (power-of-two, 1 ≤ G ≤ λ),
accumulates each group at ≥ 2×SEW internal precision, then rounds
once and adds to C. G is implementation-defined, allowing both
systolic (G=1) and outer-product (G ≈ λ) datapaths. Bit-exact
reproducibility across implementations is explicitly not guaranteed.
For integer: modular (wrapping) arithmetic makes the result uniquely
defined regardless of accumulation order.
This section specifies how the K_eff product terms may be grouped and when intermediate rounding is permitted.
176
+
177
+
===== Sub-dot-products
178
+
179
+
For widening instructions (W = 2 or W = 4), the W narrow input-element pairs packed within one accumulator-width (SEW-bit) position form a natural computational unit called a _sub-dot-product_: the W multiplications of (SEW÷W)-bit elements are performed and their products summed.
180
+
181
+
Because each individual product of two (SEW÷W)-bit values is _exact_ at SEW precision—the significand of the product has at most 2 × p~input~ − 1 bits, which fits in the wider SEW format—the sub-dot-product can be computed with very little loss at SEW or wider precision.
182
+
183
+
There are K_eff ÷ W = λ × LMUL sub-dot-products per output element.
184
+
185
+
For non-widening instructions (W = 1), each product of two SEW-bit values is exact at 2 × SEW bits; a sub-dot-product consists of a single product term.
186
+
187
+
===== Accumulation and rounding model (floating-point)
188
+
189
+
An implementation partitions the λ × LMUL sub-dot-products for each output element into consecutive groups of G sub-dot-products.
190
+
191
+
* Within each group, the G partial results are accumulated using internal precision that requires no rounding to SEW precision inside a group.
192
+
193
+
* After each group, the accumulated partial sum is rounded to C's precision (SEW) using an _implementation-defined_ rounding mode and _added to the running value of C[m, n]_.
194
+
The rounding mode used for these intermediate additions is not required to match `frm`.
195
+
196
+
The value of G is _implementation-defined_ and may depend on SEW, W, λ, LMUL, and the microarchitecture.
197
+
It must satisfy:
198
+
199
+
* G is a power of two;
200
+
* 1 ≤ G ≤ λ.
201
+
202
+
The resulting number of rounding additions to C per output element is (λ × LMUL) ÷ G.
203
+
204
+
The final accumulation step—adding the fully reduced dot-product to the original value of C[m, n]—uses the dynamic rounding mode from `frm`.
205
+
206
+
[NOTE]
207
+
====
208
+
In a *systolic-array* datapath, G is typically 1: each sub-dot-product is rounded and added to C immediately, yielding λ × LMUL rounding additions per output element.
209
+
210
+
In an *outer-product* datapath, G is typically on the order of λ (e.g. λ, λ÷2, λ÷4, or λ÷8): multiple sub-dot-products are accumulated at extended internal precision before a single rounding and C addition, significantly reducing the number of expensive full-precision additions.
211
+
212
+
Software must not depend on a particular value of G.
213
+
====
214
+
215
+
Because G and the intermediate rounding mode are implementation-defined, two conforming implementations may produce floating-point results that differ in the least-significant bits for identical inputs.
216
+
Bit-exact reproducibility of floating-point matrix multiply-accumulate results across different implementations is therefore _not_ guaranteed.
217
+
218
+
Floating-point exception flags (inexact, overflow, underflow, invalid, etc.) are accumulated into `fflags`; the order in which individual exceptions are raised within a single instruction execution is implementation-defined.
219
+
220
+
===== Integer accumulation
221
+
222
+
For integer multiply-accumulate instructions, all intermediate results are reduced modulo 2^SEW^.
223
+
Because modular addition is both associative and commutative, the final result is uniquely defined regardless of the accumulation order or grouping factor.
0 commit comments