Clarify tail handling#5
Conversation
Introduce N_tile_max = M_tile to the tile geometry formulas and add a new "C tile tail policy" section that specifies the behaviour of C elements beyond the active N_tile columns when VL is partial. The key property is that tail elements are never read or written by multiply-accumulate instructions, so tail-undisturbed (vta=0) is achieved by write-skip rather than read-merge. Implementations are therefore not required to read the tail portion of the C register group, which benefits outer-product engines and register-renaming machines.
|
I think there is still some confusion here. I am referring to the note about the C tile tail policy: For IME multiply-accumulate instructions, undisturbed behavior requires no read — the instruction leaves tail element positions untouched — which permits hardware implementations (including outer-product engines and register-renaming machines) to process only the N_tile active columns of the C tile without referencing the tail portion of the C register group. That is certainly true for the output registers that are entirely in the tail. But because we allow VL to be a multiple of λ (not λ²), the tail can begin in the middle of one of the output registers. That particular fraction of the tail needs to be read/written back, either unmodified or all "1"s. If the micro-architecture supports segmenting a register into chunks of λ elements, then yes, we can avoid reading those elements, but from the architecture perspective the register was read/modified/written. |
I would have hoped that this wording does capture exactly that tdetail: the tail-portion of the C register group (i.e., all registers that fall entirely into the tail) are untouched — and don't even have to be read. Any suggestions in how we can capture this better? |
|
Hi Philipp, I think you have it, just add "(i.e., all registers that fall entirely into the tail)". I didn't see that in the text. Is "tail-portion of the C register group" a well-established terminology that does not require the parentheses clarification? For me, "tail-portion" (without the clarification) would mean the entire tail, including the partial register. |
…l policy with base V Bring vmtl.v, vmts.v, vmttl.v, vmtts.v into alignment with the standard vector load/store element-status semantics (sec-inactive-defs): * Add a four-category element-status summary (active / inactive / tail / prestart) to the shared Instructions section, with explicit vma and vta references and a cross-link to <<sec-inactive-defs>>. * Fix "active element index i in [0, VL)" to "element index i in the body [vstart, VL) where the mask is enabled" throughout. * Extend each instruction's Description to cover the vma=1 (inactive may overwrite with 1s) and vta=1 (tail may overwrite with 1s) cases; stores now explicitly state that inactive and tail elements are not written to memory and do not raise exceptions. * Update load pseudocode comments to distinguish inactive (body, mask=0) from tail (unreachable by the loop) and name the governing vma/vta policies. * Remove init_masked_source from vmts.v and vmtts.v pseudocode; replace with read_vmask + vm_val[i] to match the loads and regular vector stores.
0537887 to
72d87ca
Compare
This PR clarifies element-status and tail-policy semantics for the Integrated Matrix Extension, bringing them into alignment with the base RISC-V V specification.
Define C tile tail policy
When VL selects fewer than the maximum number of columns (N_tile < N_tile_max), the inactive columns of the C accumulator tile are tail elements. Unlike ordinary vector instructions — where
tail-undisturbed requires a read-merge — IME multiply-accumulate instructions achieve undisturbed behaviour by write-skip: tail element positions are simply never written, so no read of the tail
portion of the C register group is required in either the
vta=0 orvta=1 case. This removes a significant burden from outer-product engines and register-renaming implementations. The tile geometrytable gains an explicit
N_tile_max = M_tiledefinition to anchor the tail boundary.Align tile load/store masking and tail policy with base V
The tile load/store instruction descriptions (
vmtl.v,vmts.v,vmttl.v,vmtts.v) previously described only thevm=0 "not written" case and used imprecise terminology. This commit rewrites therelevant prose and pseudocode to match the four-category element-status model from the base V spec: active elements access memory normally; inactive elements (body, mask disabled) follow the
vmapolicy; tail elements (index ≥ VL) follow the
vtapolicy; prestart elements are untouched. Theinit_masked_sourcecall — which had no equivalent in regular vector stores and could spuriously raisean illegal-instruction exception — is removed from the store pseudocode, replaced with the same
read_vmask+ direct mask-bit check used by the loads.