Skip to content

Clarify tail handling#5

Merged
ptomsich merged 2 commits intoriscv:integrated-matrix-extensionfrom
ptomsich:ptomsich/clarify-tail-handling
Mar 1, 2026
Merged

Clarify tail handling#5
ptomsich merged 2 commits intoriscv:integrated-matrix-extensionfrom
ptomsich:ptomsich/clarify-tail-handling

Conversation

@ptomsich
Copy link
Copy Markdown
Collaborator

@ptomsich ptomsich commented Mar 1, 2026

This PR clarifies element-status and tail-policy semantics for the Integrated Matrix Extension, bringing them into alignment with the base RISC-V V specification.

Define C tile tail policy

When VL selects fewer than the maximum number of columns (N_tile < N_tile_max), the inactive columns of the C accumulator tile are tail elements. Unlike ordinary vector instructions — where
tail-undisturbed requires a read-merge — IME multiply-accumulate instructions achieve undisturbed behaviour by write-skip: tail element positions are simply never written, so no read of the tail
portion of the C register group is required in either the vta=0 or vta=1 case. This removes a significant burden from outer-product engines and register-renaming implementations. The tile geometry
table gains an explicit N_tile_max = M_tile definition to anchor the tail boundary.

Align tile load/store masking and tail policy with base V

The tile load/store instruction descriptions (vmtl.v, vmts.v, vmttl.v, vmtts.v) previously described only the vm=0 "not written" case and used imprecise terminology. This commit rewrites the
relevant prose and pseudocode to match the four-category element-status model from the base V spec: active elements access memory normally; inactive elements (body, mask disabled) follow the vma
policy; tail elements (index ≥ VL) follow the vta policy; prestart elements are untouched. The init_masked_source call — which had no equivalent in regular vector stores and could spuriously raise
an illegal-instruction exception — is removed from the store pseudocode, replaced with the same read_vmask + direct mask-bit check used by the loads.

Introduce N_tile_max = M_tile to the tile geometry formulas and add a
new "C tile tail policy" section that specifies the behaviour of C
elements beyond the active N_tile columns when VL is partial.

The key property is that tail elements are never read or written by
multiply-accumulate instructions, so tail-undisturbed (vta=0) is
achieved by write-skip rather than read-merge.  Implementations are
therefore not required to read the tail portion of the C register group,
which benefits outer-product engines and register-renaming machines.
@ptomsich ptomsich requested review from efocht and joseemoreira March 1, 2026 10:32
@joseemoreira
Copy link
Copy Markdown
Collaborator

I think there is still some confusion here. I am referring to the note about the C tile tail policy:

For IME multiply-accumulate instructions, undisturbed behavior requires no read — the instruction leaves tail element positions untouched — which permits hardware implementations (including outer-product engines and register-renaming machines) to process only the N_tile active columns of the C tile without referencing the tail portion of the C register group.

That is certainly true for the output registers that are entirely in the tail. But because we allow VL to be a multiple of λ (not λ²), the tail can begin in the middle of one of the output registers. That particular fraction of the tail needs to be read/written back, either unmodified or all "1"s. If the micro-architecture supports segmenting a register into chunks of λ elements, then yes, we can avoid reading those elements, but from the architecture perspective the register was read/modified/written.

@ptomsich
Copy link
Copy Markdown
Collaborator Author

ptomsich commented Mar 1, 2026

without referencing the tail portion of the C register group

I would have hoped that this wording does capture exactly that tdetail: the tail-portion of the C register group (i.e., all registers that fall entirely into the tail) are untouched — and don't even have to be read.

Any suggestions in how we can capture this better?

@joseemoreira
Copy link
Copy Markdown
Collaborator

Hi Philipp, I think you have it, just add "(i.e., all registers that fall entirely into the tail)". I didn't see that in the text. Is "tail-portion of the C register group" a well-established terminology that does not require the parentheses clarification? For me, "tail-portion" (without the clarification) would mean the entire tail, including the partial register.

…l policy with base V

Bring vmtl.v, vmts.v, vmttl.v, vmtts.v into alignment with the
standard vector load/store element-status semantics (sec-inactive-defs):

* Add a four-category element-status summary (active / inactive / tail /
  prestart) to the shared Instructions section, with explicit vma and
  vta references and a cross-link to <<sec-inactive-defs>>.
* Fix "active element index i in [0, VL)" to "element index i in the
  body [vstart, VL) where the mask is enabled" throughout.
* Extend each instruction's Description to cover the vma=1 (inactive
  may overwrite with 1s) and vta=1 (tail may overwrite with 1s) cases;
  stores now explicitly state that inactive and tail elements are not
  written to memory and do not raise exceptions.
* Update load pseudocode comments to distinguish inactive (body,
  mask=0) from tail (unreachable by the loop) and name the governing
  vma/vta policies.
* Remove init_masked_source from vmts.v and vmtts.v pseudocode;
  replace with read_vmask + vm_val[i] to match the loads and regular
  vector stores.
@ptomsich ptomsich force-pushed the ptomsich/clarify-tail-handling branch from 0537887 to 72d87ca Compare March 1, 2026 21:01
@ptomsich ptomsich merged commit 70d62cd into riscv:integrated-matrix-extension Mar 1, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants