-
Notifications
You must be signed in to change notification settings - Fork 32
[AIEX] Minimize Stackslots by Decomposition and only keeping defined parts of the Stack #754
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: stuckmann.subregs.always
Are you sure you want to change the base?
[AIEX] Minimize Stackslots by Decomposition and only keeping defined parts of the Stack #754
Conversation
Move opcode-specific switch statements from AIE2InstrInfo and AIE2PInstrInfo to a generic implementation in AIEBaseInstrInfo using MachineMemOperand validation. This provides a more maintainable approach that automatically works with any spill instruction without needing explicit opcode lists.
Add tests covering: - Basic spill decomposition (DM -> CM when only CML live) - 3D iterator spill decomposition - Subreg spill optimization - Crash regression tests - LaneBitmask pollution prevention - Orphan load crash prevention
Decompose composite spill instructions (e.g., DM spills) into smaller subreg spills (e.g., CM spills) when only some subregs are live. This reduces stack usage by eliminating spills of undefined subregs. For example, if a DM register is spilled but only CML is live, we replace the 256-byte DM spill with a 128-byte CM spill. The pass uses a two-phase approach: 1. Analysis Phase: Collect spill/reload info, compute liveness as LaneBitmask 2. Transformation Phase: Decompose instructions, keeping same FI Enabled by default, can be disabled with -aie-stack-minimize=false.
Add tests covering: - Q register spill size preservation (don't increase slot size) - Register renaming with shared FI (offset-based tracking) - Mixed partial/full usage scenarios - Non-overlapping lane separation - Overlapping byte range merging - Neutral cases (no optimization expected) - FI regression crash prevention
Minimize stack slot sizes by analyzing which subregs of spilled registers are actually used. If only a subset of subregs is accessed, we recreate the stack slots with smaller sizes. For example, if a 2048-bit DM register is spilled but only the 1024-bit CML subreg is used, we shrink the stack slot from 256 bytes to 128 bytes. When multiple non-overlapping registers (based on LaneBitmask) are spilled to the same offset, the pass creates separate stack slots for each group. This pass runs after AIESpillDecomposition and completes the stack optimization pipeline enabled by -aie-stack-minimize (default: true).
8c6120d to
8908b9e
Compare
| FrameIndex = MI.getOperand(1).getIndex(); | ||
| return MI.getOperand(0).getReg(); | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hope that stack loads have exactly one MMO which is a Load, and I would feel comforted if that were asserted.
I'm also not sure where these functions are used and whether it requires completeness in identifying all stack loads, whether stack loads from user stack objects always have their MMO's set correctly.
I'm also convinced that there are quintillion combinations of stack use, of which only a fraction is tested in practice and that we should hence be cautious. It is great to have this unified, granting us the possibility to concentrate all sanity checking in one place.
| // Note that LDB path does not support SPILL instructions | ||
|
|
||
| case AIE2P::LDA_dms_lda_spill: | ||
| case AIE2P::ST_dms_sts_spill: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: you call these intrinsics in your commit comment
| # RUN: llc -mtriple=aie2p --run-pass=none -o - -verify-machineinstrs %s | FileCheck %s | ||
|
|
||
| --- | ||
| name: _Z11conv2d_bf16ILh1EL5act_t0E8bfloat16S1_S1_Lb0ELb0ELb1ELb0ELb0EEvPT1_PT2_PT3_R18conv2d_bf16_paramsS7_S7_ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can use a readable name
|
Just a convention, 'Adding baseline tests' would label a commit adding unit tests that you expect to change. |
| } | ||
|
|
||
| void AIE2PassConfig::addPostRewrite() { | ||
| if (EnableStackMinimization) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably not, but checking anyway, don't we have an AIEBasePassConfig where we can add this?
| SpillPoint() = default; | ||
|
|
||
| /// Construct from a spill/reload instruction. | ||
| SpillPoint(MachineInstr &MI, int FI, bool IsStore, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: IsStore can easily be derived from MI
|
|
||
| DenseMap<int, FrameIndexSpillInfo> FIInfoMap; | ||
|
|
||
| /// Collect all spill/reload instructions, compute per-subreg liveness. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you invented "spill point" as a short hand for spill/reload instruction, and I like it.
|
|
||
| /// Emit a single subreg spill or reload instruction. | ||
| void emitDecomposedInstr(const SpillPoint &SP, const SubRegInfo &SI, | ||
| bool IsStore) const; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds like transformSpillPoint()
| } | ||
|
|
||
| /// Build a SubRegInfo from expansion info. | ||
| static SubRegInfo |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SubRegInfo constructor?
| /// \return True if any instructions were transformed. | ||
| bool transformFunction(); | ||
|
|
||
| /// Compute live registers at \p MI. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please detail before or after MI.
| const AIEBaseInstrInfo &TII, | ||
| const MachineRegisterInfo &MRI) | ||
| : MI(&MI), FI(FI), IsStore(IsStore), | ||
| StoreFlags(MI.getOperand(0).isKill() ? RegState::Kill : 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You seem to do your own liveness analysis in the right direction. Could we use that instead of isKill() ?
|
|
||
| void AIESpillDecomposition::emitDecomposedInstr(const SpillPoint &SP, | ||
| const SubRegInfo &SI, | ||
| bool IsStore) const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't SP.IsStore sufficient?
| continue; | ||
|
|
||
| LivePhysRegs LiveRegs(*TRI); | ||
| computeLiveness(LiveRegs, MBB, MI, IsStore); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is quadratic && complicated.
You could make two traversals, one for the stores in one direction, one for the loads in the other direction and
maintain the liveness information on these traversals that collect them. Then we don't need to do a liveness traversal for every spillpoint.
Add test that verifies stack objects are ordered by alignment to minimize padding between objects with different alignment requirements.
Add orderFrameObjects hook to sort stack objects by alignment (highest first), reducing wasted stack space from padding between objects with different alignment requirements. AIE has large alignment requirements for vector registers (up to 64 bytes for BM registers).
c0ff255 to
8ff443f
Compare
| bool Emitted = false; | ||
| for (const SubRegInfo &SI : SP.SubRegs) { | ||
| const auto &Stored = StoredSubRegsPerOffset.lookup(SI.Offset); | ||
| if (Stored.empty()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check: this is the main reason why we want to deal with all Stores first, which is why we have them in separate lists.
| const AIEBaseInstrInfo &TII, const MachineRegisterInfo &MRI); | ||
| }; | ||
|
|
||
| /// Per-FI tracking of spills and reloads. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a perfect place to explain why we separate the spills and the reloads.
Also, like we discussed, mentioning the fact that the algorithm collects all live lanes over all store instructions relating to a particular FrameIndex.
| SmallVector<MachineInstr *, 16> ToErase; | ||
|
|
||
| for (auto &[FI, Info] : FIInfoMap) { | ||
| // Track which subregs were stored at each offset (local to transformation) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know what 'local to transformation means' It could be 'for this FrameIndex?'
| bool Changed = false; | ||
| do { | ||
| analyzeFunction(MF); | ||
| } while (transformFunction() && (Changed = true)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please discuss rationale and convergence of this fixpoint loop. E.g. (just guessing) livein of successor blocks may change, maximum iteration count is loop depth + 1
| int StoreFI = -1; | ||
| int LoadFI = -1; | ||
| const bool IsStore = TII->isStoreToStackSlot(MI, StoreFI) != 0; | ||
| const bool IsLoad = !IsStore && TII->isLoadFromStackSlot(MI, LoadFI) != 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've seen this code before. Could we factor this by passing a lambda to a traversal function?
This PR introduces a stack optimization pipeline.
The core idea is to not spill undef subregisters.
subreg spills (e.g., 128-byte CM spills) when only some subregs are live. Uses liveness analysis with
LaneBitmask to determine which subregs actually need spilling.
subset is used, it recreates stack slots with smaller sizes. Also separates non-overlapping registers
sharing the same frame index into independent slots.
Minor Bug fixes along the way:
using MachineMemOperand validation.
Performance numbers: I am still looking why Frame Object ordering is screwing up.
P.S. I will provide combined performance numbers with #735 integrated, because this PR is a mitgation of some unwanted stack increases of 735.