Skip to content

[AIE] Improve Zero-overhead loop 112-bytes padding #440

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Apr 23, 2025

Conversation

abnikant
Copy link
Collaborator

@abnikant abnikant commented Apr 18, 2025

Avoid inserting extra nop bundles if there are enough bundles between ZOLSetup instructions and LoopEnd to create 112-bytes padding. It takes into account the bundles in the loop body.

  1. Added heuristic to check for available size between ZOLSetup last bundle and (LoopEnd - 1) - for the cases where number of loop bundles are >= LoopSetupDistance.
    In plan: More heuristics to be added 1) Avoid elongating if there are 112-bytes available between writing of (ls, le, and lc) and (loop end - 1), for the case where number of loop bundles < LoopSetupDistance. The EgdeLatency needs to be updated. 2) Elongate the bundles after deducting the available size.

QoR report :
Core_PMSize
Core_Insn_Count

@@ -435,6 +439,10 @@ struct AIEBaseInstrInfo : public TargetInstrInfo {
// setting LE Reg
bool isZOLSetupBundle(MachineBasicBlock::iterator MII) const;
bool isLastZOLSetupBundleInMBB(MachineBasicBlock::iterator MII) const;
virtual const AIE::Bundle<MachineInstr>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check: we return by value, so I guess we have the implementation of AIE::Bundle available here. Can we drop the forward declaration?

@abnikant
Copy link
Collaborator Author

abnikant commented Apr 22, 2025

Core_Insn_Count.csv
Core_PMSize.csv
Core_StackSize.csv
QoR data w.r.t. to original run with Peano.

if (IsZOLBody && ZOLBodyRegionsCount > 0) {
AlgnCandidates.emplace_back(MI);
ZOLBodyRegionsCount--;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, you're actually padding the bundles in the loop. I had thought you would just subtract the size of the body in units of full bundles from LoopSetupDistance, putting all padding in the preheader. That would make the algorithm logic much simpler. Am I missing something?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I am padding the loop bundles this way we can avoid inserting extra nop bundles in most of the cases. Consider the case where preheader size after writing to the ZOL setup instruction is 0, we will end of inserting some nop bundles if the loop size is not sufficient.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point. But I'm still worried about the nops we insert in the gemm example, where the body was large enough.

@@ -12,50 +12,6 @@

--- |
define dso_local void @addStore(ptr addrspace(5) noalias nocapture writeonly %d, i32 noundef %n) local_unnamed_addr #0 {
; CHECK-LABEL: addStore:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happened here?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit update_llc_test..

AlgnCandidates.emplace_back(MI);
ZOLBodyRegionsCount--;
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still, I think LoopSetupDistance and ZOLBodyRegionsCount are the same thing. We either have a preheader or a body of a ZOL, not both, and we have a number of bundles that should be expanded to 16 bytes. The body one computes that amount of bundles before hand, the preheader starts it at the last loop setup instruction.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

@abnikant abnikant force-pushed the aie2p.zol.distance branch from a0361fa to 3731b75 Compare April 23, 2025 09:22
@abnikant abnikant force-pushed the aie2p.zol.distance branch 3 times, most recently from 65c621d to f58c740 Compare April 23, 2025 12:11
if (Last == MBB.end())
return 0;

for (auto MI = MBB.begin(), End = MBB.end(); MI != End; ++MI) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: for (const auto &MI: MBB)

return 0;

for (auto MI = MBB.begin(), End = MBB.end(); MI != End; ++MI) {
if (MI->isBundle() && isZOLSetupBundle(MI) &&
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the difference between isZOLSetupBundle and isLastZOLSetupBundleInMBB?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There can be more ZOLSetUpBundles in a block, but only one is the last.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we can streamline the code by running the loop in reverse and accumulating the region size for each bundle that we see. Then we can return the region size as soon as we hit a ZOLSetUpBundle. llvm::reverse(MBB) gives you the reverse range.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, updated the code to use loop in reverse, please check.

bool IsCall = false;
auto ZOLSupport = getZOLSupport();
assert(ZOLSupport);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is getAlignmentBoundaries used just with ZOL?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! Perhaps move the assert with the if (IsZOLBody) below.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is true that this is not used only in case of ZOL, but why this should assert even if it has no ZOL? it is a query and just returns pre-populated struct object value.

// must be a 112-byte gap (in PM address space) between writing to the ls,
// le, and lc registers and the LoopEnd instruction.
ZOLBundlesCount = getZOLBundlesCount(MBB) - 1;
if (ZOLBundlesCount < LoopSetupDist)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LoopSetupDist and LoopSetupDistance are confusing. Could you please rename one?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated the name.


unsigned AIEBaseInstrInfo::getPostZOLRegionSizeFromZOLBody(
MachineBasicBlock &LoopMBB) const {
for (auto *Pred : LoopMBB.predecessors()) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use AIELoopUtils::getLoopPredecessor here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought this is used only when the predecessor is a loop or isSingleMBBLoop. This function is called from ZOL Body MBB and we want to get the pre-header MBB.

return 0;
}

unsigned AIEBaseInstrInfo::getPostZOLRegionSizeFromZOLBody(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel that this Post is Pre, or even PreHeader.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is computing the number of bytes after the last ZOL setup instruction in the preheader, I guess name captures that, what is the name you suggest ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe getPostZOLSetupRegionSize + this comment.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done, also moved into a lamda function.

@abnikant abnikant force-pushed the aie2p.zol.distance branch from f58c740 to 9106adc Compare April 23, 2025 14:28
@abnikant abnikant force-pushed the aie2p.zol.distance branch from 9106adc to 618aeae Compare April 23, 2025 14:30
@abnikant abnikant force-pushed the aie2p.zol.distance branch from 618aeae to 87a5279 Compare April 23, 2025 15:49
Copy link
Collaborator

@andcarminati andcarminati left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Great work @abnikant !

@abnikant abnikant merged commit b1ac573 into aie-public Apr 23, 2025
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants