Skip to content

Commit 41fec04

Browse files
joseemoreiraptomsich
authored andcommitted
unprivileged/integrated-matrix: Address comments from 2nd IME TG internal review
- Clarify that W couples the arithmetic widening ratio and the packing of input elements within each SEW-wide storage position; document the non-widening (W=1) and widening (W>1) cases explicitly. - MUL_C: add the requirement that lambda^2 must divide VLEN/SEW, and that the C register group index must be a multiple of MUL_C. - vtype: note that `altfmt_A`, `altfmt_B`, and `bs` sit outside both the `vsetvli` and `vsetivli` immediate fields. - Normalize terminology: use "non-widening" / "widening" consistently in preference to "non-packing" / "double-packing" / etc., and refer to narrow inputs as packed into SEW-wide storage positions. - Remove the incorrect restriction that forbade mixed OFP8 inputs (E4M3 x E5M2) on vfmmacc.vv: the encoding map already permits all four OFP8 combinations, and the justification in the NOTE was mathematically inconsistent with the same-format case. - SAIL: change `let LD` to `var LD` in the four tile load/store operations, since LD is reassigned within the body.
1 parent 40c9c6f commit 41fec04

1 file changed

Lines changed: 36 additions & 21 deletions

File tree

src/integrated-matrix.adoc

Lines changed: 36 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -31,26 +31,42 @@ The Zvvm family of Integrated Matrix extensions (Zvvmm, Zvvfmm, Zvvmtls, Zvvmttl
3131

3232
The geometry of the multiplier and the tiles is defined by the new parameter `lambda` (λ) which is encoded in 3 bits in the `vtype` CSR, and vector operation parameters like the widening `W` of the multiplication encoded in the instruction, `LMUL`, `SEW` and `VLEN`.
3333

34+
The Zvvm family uses the parameter `W` to describe both the arithmetic widening ratio
35+
and the corresponding packing of input elements within each `SEW`-wide storage position.
36+
37+
When `W = 1`, the instruction is _non-widening_: each input element has width `SEW`,
38+
no packing occurs, and the accumulator C also has width `SEW`.
39+
40+
When `W > 1`, the instruction is _widening_: each input element has width `SEW/W`,
41+
and `W` such narrow input elements are packed into each `SEW`-wide storage position.
42+
The accumulator C still has width `SEW`.
43+
44+
Thus, "packing" refers to the storage layout of the inputs in the vector registers,
45+
whereas "widening" refers to the arithmetic relationship between the input element width
46+
and the accumulator element width. In this specification, these two views are coupled by
47+
the same parameter `W`.
48+
3449
Matrix tiles are represented using the existing RISC-V V register file and its configuration state.
3550
The three matrices in the multiply-accumulate operation C ← A × B^T^ + C are stored as follows:
3651

3752
* The _accumulator_ C is stored in a vector register group with element width SEW.
3853
Its register group multiplier MUL_C is determined by the tile geometry:
3954
MUL_C = (VLEN / SEW) / λ^2^, where λ is the tile-layout parameter decoded from the `lambda[2:0]` field in `vtype`.
4055
The C register group may start at any vector register index that is MUL_C-aligned.
56+
λ^2^ must divide (VLEN / SEW) such that
4157
MUL_C ∈ {1, 2, 4, 8, 16}.
4258
If MUL_C = 16, the only allowed vector register indices are 0 and 16.
59+
In general, the vector register index for the C register group must be a mulitple of MUL_C.
4360

4461
* The _input matrices_ A and B are stored in vector register groups with element width determined by the instruction:
45-
equal to SEW for non-packing variants, SEW/2 for double-packing, SEW/4 for quad-packing, and SEW/8 for octo-packing variants.
46-
The effective K dimension of the multiply (K_eff) equals λ × W × LMUL, where W is 1 for non-packing,
47-
2 for double-packing, 4 for quad-packing, and 8 for octo-packing instructions.
62+
equal to SEW for non-widening variants (`W=1`), and equal to SEW/2, SEW/4, or SEW/8 for widening variants (`W=2`, `W=4`, or `W=8`), with the narrower input elements packed within each `SEW`-wide storage position.
63+
The effective K dimension of the multiply (K_eff) equals λ × W × LMUL, where W is 1 for non-widening instructions, 2 for 2× widening instructions, 4 for 4× widening instructions, and 8 for 8× widening instructions.
4864
LMUL scales the A and B register groups along the K dimension only and does not affect C.
4965
Only integer values of LMUL are supported by the Zvvm family of Integrated Matrix extensions: LMUL ∈ {1, 2, 4, 8}.
5066
Fractional LMUL settings (LMUL < 1) are reserved and shall raise an illegal-instruction exception when used with any IME instruction.
5167

5268
[#ime-geometry-fig]
53-
.Geometry of matrix tiles and element ordering for 32 element vector registers and λ=4. VRs are interpreted as 2D tiles. Vector element indices show the tile element order. (a) Non-widening case with A, B, C having the same SEW. (b) Widening case with A and B having half the SEW of C (double-packing).
69+
.Geometry of matrix tiles and element ordering for 32 element vector registers and λ=4. VRs are interpreted as 2D tiles. Vector element indices show the tile element order. (a) Non-widening case with A, B, and C having the same SEW. (b) Widening case with A and B having half the SEW of C, with two narrow input elements packed into each SEW-wide storage position.
5470
image::png/ime-geometry.png[width=100%, align=center, alt="Diagram of matrix tile geometry and multiplier configuration parameters."]
5571

5672
The K-dimension of the multiplication (shared inner dimension of A and B^T^) is determined by λ from `vtype`, scaled by a per-instruction widening factor W and further multiplied by LMUL:
@@ -312,7 +328,7 @@ Their meaning depends on whether the instruction is a floating-point or integer
312328

313329
The `altfmt_A` and `altfmt_B` bits are located at `vtype[XLEN-5]` and
314330
`vtype[XLEN-6]`, immediately below the `lambda[2:0]` field.
315-
These positions are outside the `vsetvli` immediate field and must be
331+
These positions are outside the `vsetvli` or `vsetivli` immediate field and must be
316332
configured via `vsetvl` (with the full `vtype` value in a register) or
317333
`vsetivli`.
318334

@@ -370,19 +386,18 @@ The `bs` bit is located at `vtype[XLEN-7]`, immediately below `altfmt_B`.
370386
Like `altfmt_A` and `altfmt_B`, this position is outside the `vsetvli`
371387
immediate field.
372388

373-
374389
=== Storage formats
375390

376391
==== Element packing in input tiles
377392

378-
For non-widening multiply-accumulate instructions (W=1), the input elements
379-
A and B have the same width as the accumulator (SEW) and no packing occurs.
393+
For non-widening multiply-accumulate instructions (`W=1`), the input elements
394+
A and B have the same width as the accumulator (`SEW`), and no packing occurs.
380395

381-
For widening instructions (W=2 or W=4), each input register group holds
382-
elements at the effective input element width EEW = SEW ÷ W. Because
383-
tile load instructions always transfer data at SEW granularity, every loaded
384-
SEW-bit position contains W contiguous narrow elements that the
385-
multiply-accumulate instruction consumes as a sub-dot-product.
396+
For widening multiply-accumulate instructions (`W>1`), each input register group holds
397+
elements at the effective input element width `EEW = SEW ÷ W`. Because tile load
398+
instructions always transfer data at `SEW` granularity, every loaded `SEW`-bit storage
399+
position contains `W` contiguous packed narrow elements that the multiply-accumulate
400+
instruction consumes as a sub-dot-product.
386401

387402
[#ime-tile-widening-fig]
388403
.Element distribution and tile geometry example for L=32, SEW wide elements (left), two SEW/2 wide elements packed per SEW (middle), and four SEW/4 wide elements per SEW (right). Packing/widening by W increases the effective K dimension of the tile by a factor of W.
@@ -557,10 +572,10 @@ Likewise, for 8-bit inputs, `altfmt_A` and `altfmt_B` independently select
557572
E4M3 or E5M2. All four combinations are covered by the same OFP8
558573
subextension for the given output format.
559574

560-
NOTE: Mixed OFP8 inputs (E4M3 × E5M2) are only permitted with widening
561-
instructions (`vfwmmacc.vv`, `vfqmmacc.vv`), not with `vfmmacc.vv`
562-
(OFP8 → OFP8), because the exact product (up to 7 significand bits) exceeds
563-
the OFP8 output precision (p ≤ 4).
575+
// NOTE: Mixed OFP8 inputs (E4M3 × E5M2) are only permitted with widening
576+
// instructions (`vfwmmacc.vv`, `vfqmmacc.vv`), not with `vfmmacc.vv`
577+
// (OFP8 → OFP8), because the exact product (up to 7 significand bits) exceeds
578+
// the OFP8 output precision (p ≤ 4).
564579

565580
===== 16-bit inputs (IEEE binary16, BFloat16)
566581

@@ -3204,7 +3219,7 @@ let eff_lambda : int =
32043219
}
32053220
};
32063221

3207-
let LD : int = unsigned(X(rs2));
3222+
var LD : int = unsigned(X(rs2));
32083223
let vm_val = read_vmask(num_elem, vm, zvreg);
32093224
let linesize : int = eff_lambda * LMUL;
32103225

@@ -3302,7 +3317,7 @@ let eff_lambda : int =
33023317
}
33033318
};
33043319

3305-
let LD : int = unsigned(X(rs2));
3320+
var LD : int = unsigned(X(rs2));
33063321
let vm_val = read_vmask(num_elem, vm, zvreg);
33073322
let linesize : int = eff_lambda * LMUL;
33083323

@@ -3404,7 +3419,7 @@ let eff_lambda : int =
34043419
}
34053420
};
34063421

3407-
let LD : int = unsigned(X(rs2));
3422+
var LD : int = unsigned(X(rs2));
34083423
let vm_val = read_vmask(num_elem, vm, zvreg);
34093424
let linesize : int = eff_lambda * LMUL;
34103425

@@ -3504,7 +3519,7 @@ let eff_lambda : int =
35043519
}
35053520
};
35063521

3507-
let LD : int = unsigned(X(rs2));
3522+
var LD : int = unsigned(X(rs2));
35083523
let vm_val = read_vmask(num_elem, vm, zvreg);
35093524
let linesize : int = eff_lambda * LMUL;
35103525

0 commit comments

Comments
 (0)