You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
unprivileged/integrated-matrix: Address comments from 2nd IME TG internal review
- Clarify that W couples the arithmetic widening ratio and the
packing of input elements within each SEW-wide storage position;
document the non-widening (W=1) and widening (W>1) cases explicitly.
- MUL_C: add the requirement that lambda^2 must divide VLEN/SEW, and
that the C register group index must be a multiple of MUL_C.
- vtype: note that `altfmt_A`, `altfmt_B`, and `bs` sit outside both
the `vsetvli` and `vsetivli` immediate fields.
- Normalize terminology: use "non-widening" / "widening" consistently
in preference to "non-packing" / "double-packing" / etc., and refer
to narrow inputs as packed into SEW-wide storage positions.
- Remove the incorrect restriction that forbade mixed OFP8 inputs
(E4M3 x E5M2) on vfmmacc.vv: the encoding map already permits all
four OFP8 combinations, and the justification in the NOTE was
mathematically inconsistent with the same-format case.
- SAIL: change `let LD` to `var LD` in the four tile load/store
operations, since LD is reassigned within the body.
Copy file name to clipboardExpand all lines: src/integrated-matrix.adoc
+36-21Lines changed: 36 additions & 21 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -31,26 +31,42 @@ The Zvvm family of Integrated Matrix extensions (Zvvmm, Zvvfmm, Zvvmtls, Zvvmttl
31
31
32
32
The geometry of the multiplier and the tiles is defined by the new parameter `lambda` (λ) which is encoded in 3 bits in the `vtype` CSR, and vector operation parameters like the widening `W` of the multiplication encoded in the instruction, `LMUL`, `SEW` and `VLEN`.
33
33
34
+
The Zvvm family uses the parameter `W` to describe both the arithmetic widening ratio
35
+
and the corresponding packing of input elements within each `SEW`-wide storage position.
36
+
37
+
When `W = 1`, the instruction is _non-widening_: each input element has width `SEW`,
38
+
no packing occurs, and the accumulator C also has width `SEW`.
39
+
40
+
When `W > 1`, the instruction is _widening_: each input element has width `SEW/W`,
41
+
and `W` such narrow input elements are packed into each `SEW`-wide storage position.
42
+
The accumulator C still has width `SEW`.
43
+
44
+
Thus, "packing" refers to the storage layout of the inputs in the vector registers,
45
+
whereas "widening" refers to the arithmetic relationship between the input element width
46
+
and the accumulator element width. In this specification, these two views are coupled by
47
+
the same parameter `W`.
48
+
34
49
Matrix tiles are represented using the existing RISC-V V register file and its configuration state.
35
50
The three matrices in the multiply-accumulate operation C ← A × B^T^ + C are stored as follows:
36
51
37
52
* The _accumulator_ C is stored in a vector register group with element width SEW.
38
53
Its register group multiplier MUL_C is determined by the tile geometry:
39
54
MUL_C = (VLEN / SEW) / λ^2^, where λ is the tile-layout parameter decoded from the `lambda[2:0]` field in `vtype`.
40
55
The C register group may start at any vector register index that is MUL_C-aligned.
56
+
λ^2^ must divide (VLEN / SEW) such that
41
57
MUL_C ∈ {1, 2, 4, 8, 16}.
42
58
If MUL_C = 16, the only allowed vector register indices are 0 and 16.
59
+
In general, the vector register index for the C register group must be a mulitple of MUL_C.
43
60
44
61
* The _input matrices_ A and B are stored in vector register groups with element width determined by the instruction:
45
-
equal to SEW for non-packing variants, SEW/2 for double-packing, SEW/4 for quad-packing, and SEW/8 for octo-packing variants.
46
-
The effective K dimension of the multiply (K_eff) equals λ × W × LMUL, where W is 1 for non-packing,
47
-
2 for double-packing, 4 for quad-packing, and 8 for octo-packing instructions.
62
+
equal to SEW for non-widening variants (`W=1`), and equal to SEW/2, SEW/4, or SEW/8 for widening variants (`W=2`, `W=4`, or `W=8`), with the narrower input elements packed within each `SEW`-wide storage position.
63
+
The effective K dimension of the multiply (K_eff) equals λ × W × LMUL, where W is 1 for non-widening instructions, 2 for 2× widening instructions, 4 for 4× widening instructions, and 8 for 8× widening instructions.
48
64
LMUL scales the A and B register groups along the K dimension only and does not affect C.
49
65
Only integer values of LMUL are supported by the Zvvm family of Integrated Matrix extensions: LMUL ∈ {1, 2, 4, 8}.
50
66
Fractional LMUL settings (LMUL < 1) are reserved and shall raise an illegal-instruction exception when used with any IME instruction.
51
67
52
68
[#ime-geometry-fig]
53
-
.Geometry of matrix tiles and element ordering for 32 element vector registers and λ=4. VRs are interpreted as 2D tiles. Vector element indices show the tile element order. (a) Non-widening case with A, B, C having the same SEW. (b) Widening case with A and B having half the SEW of C (double-packing).
69
+
.Geometry of matrix tiles and element ordering for 32 element vector registers and λ=4. VRs are interpreted as 2D tiles. Vector element indices show the tile element order. (a) Non-widening case with A, B, and C having the same SEW. (b) Widening case with A and B having half the SEW of C, with two narrow input elements packed into each SEW-wide storage position.
54
70
image::png/ime-geometry.png[width=100%, align=center, alt="Diagram of matrix tile geometry and multiplier configuration parameters."]
55
71
56
72
The K-dimension of the multiplication (shared inner dimension of A and B^T^) is determined by λ from `vtype`, scaled by a per-instruction widening factor W and further multiplied by LMUL:
@@ -312,7 +328,7 @@ Their meaning depends on whether the instruction is a floating-point or integer
312
328
313
329
The `altfmt_A` and `altfmt_B` bits are located at `vtype[XLEN-5]` and
314
330
`vtype[XLEN-6]`, immediately below the `lambda[2:0]` field.
315
-
These positions are outside the `vsetvli` immediate field and must be
331
+
These positions are outside the `vsetvli` or `vsetivli` immediate field and must be
316
332
configured via `vsetvl` (with the full `vtype` value in a register) or
317
333
`vsetivli`.
318
334
@@ -370,19 +386,18 @@ The `bs` bit is located at `vtype[XLEN-7]`, immediately below `altfmt_B`.
370
386
Like `altfmt_A` and `altfmt_B`, this position is outside the `vsetvli`
371
387
immediate field.
372
388
373
-
374
389
=== Storage formats
375
390
376
391
==== Element packing in input tiles
377
392
378
-
For non-widening multiply-accumulate instructions (W=1), the input elements
379
-
A and B have the same width as the accumulator (SEW) and no packing occurs.
393
+
For non-widening multiply-accumulate instructions (`W=1`), the input elements
394
+
A and B have the same width as the accumulator (`SEW`), and no packing occurs.
380
395
381
-
For widening instructions (W=2 or W=4), each input register group holds
382
-
elements at the effective input element width EEW = SEW ÷ W. Because
383
-
tile load instructions always transfer data at SEW granularity, every loaded
384
-
SEW-bit position contains W contiguous narrow elements that the
385
-
multiply-accumulate instruction consumes as a sub-dot-product.
396
+
For widening multiply-accumulate instructions (`W>1`), each input register group holds
397
+
elements at the effective input element width `EEW = SEW ÷ W`. Because tile load
398
+
instructions always transfer data at `SEW` granularity, every loaded `SEW`-bit storage
399
+
position contains `W` contiguous packed narrow elements that the multiply-accumulate
400
+
instruction consumes as a sub-dot-product.
386
401
387
402
[#ime-tile-widening-fig]
388
403
.Element distribution and tile geometry example for L=32, SEW wide elements (left), two SEW/2 wide elements packed per SEW (middle), and four SEW/4 wide elements per SEW (right). Packing/widening by W increases the effective K dimension of the tile by a factor of W.
@@ -557,10 +572,10 @@ Likewise, for 8-bit inputs, `altfmt_A` and `altfmt_B` independently select
557
572
E4M3 or E5M2. All four combinations are covered by the same OFP8
558
573
subextension for the given output format.
559
574
560
-
NOTE: Mixed OFP8 inputs (E4M3 × E5M2) are only permitted with widening
561
-
instructions (`vfwmmacc.vv`, `vfqmmacc.vv`), not with `vfmmacc.vv`
562
-
(OFP8 → OFP8), because the exact product (up to 7 significand bits) exceeds
563
-
the OFP8 output precision (p ≤ 4).
575
+
// NOTE: Mixed OFP8 inputs (E4M3 × E5M2) are only permitted with widening
576
+
// instructions (`vfwmmacc.vv`, `vfqmmacc.vv`), not with `vfmmacc.vv`
577
+
// (OFP8 → OFP8), because the exact product (up to 7 significand bits) exceeds
0 commit comments