Skip to content

Conversation

@F-Stuckmann
Copy link
Collaborator

@F-Stuckmann F-Stuckmann commented Jan 15, 2026

This PR introduces a stack optimization pipeline.
The core idea is to not spill undef subregisters.

  • AIESpillDecomposition – Decomposes composite spill instructions (e.g., 256-byte DM spills) into smaller
    subreg spills (e.g., 128-byte CM spills) when only some subregs are live. Uses liveness analysis with
    LaneBitmask to determine which subregs actually need spilling.
  • AIEStackSlotMinimization – Analyzes which subregs of spilled registers are actually accessed. When only a
    subset is used, it recreates stack slots with smaller sizes. Also separates non-overlapping registers
    sharing the same frame index into independent slots.
  • Frame Object Ordering - Added orderFrameObjects hook in AIEBaseFrameLowering to sort stack objects by alignment (highest first), reducing wasted padding.

Minor Bug fixes along the way:

  • Moved isLoadFromStackSlot/isStoreToStackSlot implementations from AIE2/AIE2P to a generic AIEBaseInstrInfo
    using MachineMemOperand validation.
  • Added missing intrinsics to AIE2P frame index lowering.

Performance numbers: I am still looking why Frame Object ordering is screwing up.

Core_Compute_Insn_Count_perf_sorted_reduced Core_PMSize_perf_sorted_reduced_absolute Core_StackSize_perf_sorted_reduced_absolute

P.S. I will provide combined performance numbers with #735 integrated, because this PR is a mitgation of some unwanted stack increases of 735.

Move opcode-specific switch statements from AIE2InstrInfo and AIE2PInstrInfo
to a generic implementation in AIEBaseInstrInfo using MachineMemOperand
validation. This provides a more maintainable approach that automatically
works with any spill instruction without needing explicit opcode lists.
Add tests covering:
- Basic spill decomposition (DM -> CM when only CML live)
- 3D iterator spill decomposition
- Subreg spill optimization
- Crash regression tests
- LaneBitmask pollution prevention
- Orphan load crash prevention
Decompose composite spill instructions (e.g., DM spills) into smaller subreg
spills (e.g., CM spills) when only some subregs are live. This reduces stack
usage by eliminating spills of undefined subregs.

For example, if a DM register is spilled but only CML is live, we replace
the 256-byte DM spill with a 128-byte CM spill.

The pass uses a two-phase approach:
1. Analysis Phase: Collect spill/reload info, compute liveness as LaneBitmask
2. Transformation Phase: Decompose instructions, keeping same FI

Enabled by default, can be disabled with -aie-stack-minimize=false.
Add tests covering:
- Q register spill size preservation (don't increase slot size)
- Register renaming with shared FI (offset-based tracking)
- Mixed partial/full usage scenarios
- Non-overlapping lane separation
- Overlapping byte range merging
- Neutral cases (no optimization expected)
- FI regression crash prevention
Minimize stack slot sizes by analyzing which subregs of spilled registers
are actually used. If only a subset of subregs is accessed, we recreate
the stack slots with smaller sizes.

For example, if a 2048-bit DM register is spilled but only the 1024-bit CML
subreg is used, we shrink the stack slot from 256 bytes to 128 bytes.

When multiple non-overlapping registers (based on LaneBitmask) are spilled
to the same offset, the pass creates separate stack slots for each group.

This pass runs after AIESpillDecomposition and completes the stack optimization
pipeline enabled by -aie-stack-minimize (default: true).
@F-Stuckmann F-Stuckmann changed the title [Z [AIEX] Stackslot reduction Jan 15, 2026
@F-Stuckmann F-Stuckmann changed the title [AIEX] Stackslot reduction [AIEX] Minimize Stackslots by Decomposition and only keeping defined parts of the Stack Jan 15, 2026
@F-Stuckmann F-Stuckmann changed the base branch from aie-public to stuckmann.subregs.always January 15, 2026 16:22
@F-Stuckmann F-Stuckmann force-pushed the stuckmann.subregs.always branch from 8c6120d to 8908b9e Compare January 15, 2026 16:28
@jorickert jorickert removed the request for review from ljfitz January 15, 2026 17:49
FrameIndex = MI.getOperand(1).getIndex();
return MI.getOperand(0).getReg();
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope that stack loads have exactly one MMO which is a Load, and I would feel comforted if that were asserted.
I'm also not sure where these functions are used and whether it requires completeness in identifying all stack loads, whether stack loads from user stack objects always have their MMO's set correctly.
I'm also convinced that there are quintillion combinations of stack use, of which only a fraction is tested in practice and that we should hence be cautious. It is great to have this unified, granting us the possibility to concentrate all sanity checking in one place.

// Note that LDB path does not support SPILL instructions

case AIE2P::LDA_dms_lda_spill:
case AIE2P::ST_dms_sts_spill:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: you call these intrinsics in your commit comment

# RUN: llc -mtriple=aie2p --run-pass=none -o - -verify-machineinstrs %s | FileCheck %s

---
name: _Z11conv2d_bf16ILh1EL5act_t0E8bfloat16S1_S1_Lb0ELb0ELb1ELb0ELb0EEvPT1_PT2_PT3_R18conv2d_bf16_paramsS7_S7_
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can use a readable name

@martien-de-jong
Copy link
Collaborator

Just a convention, 'Adding baseline tests' would label a commit adding unit tests that you expect to change.
"Adding unit tests" would then add tests that you observed were missing, containing behaviour you want to keep.
As a reviewer, I usually just skim commits of the first category, since I will see the changes in the functional commits.

}

void AIE2PassConfig::addPostRewrite() {
if (EnableStackMinimization) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably not, but checking anyway, don't we have an AIEBasePassConfig where we can add this?

SpillPoint() = default;

/// Construct from a spill/reload instruction.
SpillPoint(MachineInstr &MI, int FI, bool IsStore,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: IsStore can easily be derived from MI


DenseMap<int, FrameIndexSpillInfo> FIInfoMap;

/// Collect all spill/reload instructions, compute per-subreg liveness.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you invented "spill point" as a short hand for spill/reload instruction, and I like it.


/// Emit a single subreg spill or reload instruction.
void emitDecomposedInstr(const SpillPoint &SP, const SubRegInfo &SI,
bool IsStore) const;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like transformSpillPoint()

}

/// Build a SubRegInfo from expansion info.
static SubRegInfo
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SubRegInfo constructor?

/// \return True if any instructions were transformed.
bool transformFunction();

/// Compute live registers at \p MI.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please detail before or after MI.

const AIEBaseInstrInfo &TII,
const MachineRegisterInfo &MRI)
: MI(&MI), FI(FI), IsStore(IsStore),
StoreFlags(MI.getOperand(0).isKill() ? RegState::Kill : 0) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You seem to do your own liveness analysis in the right direction. Could we use that instead of isKill() ?


void AIESpillDecomposition::emitDecomposedInstr(const SpillPoint &SP,
const SubRegInfo &SI,
bool IsStore) const {
Copy link
Collaborator

@martien-de-jong martien-de-jong Jan 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't SP.IsStore sufficient?

continue;

LivePhysRegs LiveRegs(*TRI);
computeLiveness(LiveRegs, MBB, MI, IsStore);
Copy link
Collaborator

@martien-de-jong martien-de-jong Jan 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is quadratic && complicated.
You could make two traversals, one for the stores in one direction, one for the loads in the other direction and
maintain the liveness information on these traversals that collect them. Then we don't need to do a liveness traversal for every spillpoint.

Add test that verifies stack objects are ordered by alignment to minimize
padding between objects with different alignment requirements.
Add orderFrameObjects hook to sort stack objects by alignment (highest first),
reducing wasted stack space from padding between objects with different
alignment requirements. AIE has large alignment requirements for vector
registers (up to 64 bytes for BM registers).
@F-Stuckmann F-Stuckmann force-pushed the stuckmann.stackslot.minimize.pr.ready branch from c0ff255 to 8ff443f Compare January 16, 2026 13:58
bool Emitted = false;
for (const SubRegInfo &SI : SP.SubRegs) {
const auto &Stored = StoredSubRegsPerOffset.lookup(SI.Offset);
if (Stored.empty())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check: this is the main reason why we want to deal with all Stores first, which is why we have them in separate lists.

const AIEBaseInstrInfo &TII, const MachineRegisterInfo &MRI);
};

/// Per-FI tracking of spills and reloads.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a perfect place to explain why we separate the spills and the reloads.
Also, like we discussed, mentioning the fact that the algorithm collects all live lanes over all store instructions relating to a particular FrameIndex.

SmallVector<MachineInstr *, 16> ToErase;

for (auto &[FI, Info] : FIInfoMap) {
// Track which subregs were stored at each offset (local to transformation)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know what 'local to transformation means' It could be 'for this FrameIndex?'

bool Changed = false;
do {
analyzeFunction(MF);
} while (transformFunction() && (Changed = true));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please discuss rationale and convergence of this fixpoint loop. E.g. (just guessing) livein of successor blocks may change, maximum iteration count is loop depth + 1

int StoreFI = -1;
int LoadFI = -1;
const bool IsStore = TII->isStoreToStackSlot(MI, StoreFI) != 0;
const bool IsLoad = !IsStore && TII->isLoadFromStackSlot(MI, LoadFI) != 0;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've seen this code before. Could we factor this by passing a lambda to a traversal function?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants