-
Notifications
You must be signed in to change notification settings - Fork 25
[AIE] Improve Zero-overhead loop 112-bytes padding #440
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
4e6d300
to
8d72538
Compare
@@ -435,6 +439,10 @@ struct AIEBaseInstrInfo : public TargetInstrInfo { | |||
// setting LE Reg | |||
bool isZOLSetupBundle(MachineBasicBlock::iterator MII) const; | |||
bool isLastZOLSetupBundleInMBB(MachineBasicBlock::iterator MII) const; | |||
virtual const AIE::Bundle<MachineInstr> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check: we return by value, so I guess we have the implementation of AIE::Bundle available here. Can we drop the forward declaration?
Core_Insn_Count.csv |
if (IsZOLBody && ZOLBodyRegionsCount > 0) { | ||
AlgnCandidates.emplace_back(MI); | ||
ZOLBodyRegionsCount--; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, you're actually padding the bundles in the loop. I had thought you would just subtract the size of the body in units of full bundles from LoopSetupDistance, putting all padding in the preheader. That would make the algorithm logic much simpler. Am I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I am padding the loop bundles this way we can avoid inserting extra nop bundles in most of the cases. Consider the case where preheader size after writing to the ZOL setup instruction is 0, we will end of inserting some nop bundles if the loop size is not sufficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair point. But I'm still worried about the nops we insert in the gemm example, where the body was large enough.
@@ -12,50 +12,6 @@ | |||
|
|||
--- | | |||
define dso_local void @addStore(ptr addrspace(5) noalias nocapture writeonly %d, i32 noundef %n) local_unnamed_addr #0 { | |||
; CHECK-LABEL: addStore: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happened here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit update_llc_test..
AlgnCandidates.emplace_back(MI); | ||
ZOLBodyRegionsCount--; | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still, I think LoopSetupDistance and ZOLBodyRegionsCount are the same thing. We either have a preheader or a body of a ZOL, not both, and we have a number of bundles that should be expanded to 16 bytes. The body one computes that amount of bundles before hand, the preheader starts it at the last loop setup instruction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
a0361fa
to
3731b75
Compare
65c621d
to
f58c740
Compare
if (Last == MBB.end()) | ||
return 0; | ||
|
||
for (auto MI = MBB.begin(), End = MBB.end(); MI != End; ++MI) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: for (const auto &MI: MBB)
return 0; | ||
|
||
for (auto MI = MBB.begin(), End = MBB.end(); MI != End; ++MI) { | ||
if (MI->isBundle() && isZOLSetupBundle(MI) && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the difference between isZOLSetupBundle
and isLastZOLSetupBundleInMBB
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There can be more ZOLSetUpBundles in a block, but only one is the last.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we can streamline the code by running the loop in reverse and accumulating the region size for each bundle that we see. Then we can return the region size as soon as we hit a ZOLSetUpBundle. llvm::reverse(MBB) gives you the reverse range.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, updated the code to use loop in reverse, please check.
bool IsCall = false; | ||
auto ZOLSupport = getZOLSupport(); | ||
assert(ZOLSupport); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is getAlignmentBoundaries
used just with ZOL?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! Perhaps move the assert with the if (IsZOLBody)
below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is true that this is not used only in case of ZOL, but why this should assert even if it has no ZOL? it is a query and just returns pre-populated struct object value.
// must be a 112-byte gap (in PM address space) between writing to the ls, | ||
// le, and lc registers and the LoopEnd instruction. | ||
ZOLBundlesCount = getZOLBundlesCount(MBB) - 1; | ||
if (ZOLBundlesCount < LoopSetupDist) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LoopSetupDist
and LoopSetupDistance
are confusing. Could you please rename one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated the name.
|
||
unsigned AIEBaseInstrInfo::getPostZOLRegionSizeFromZOLBody( | ||
MachineBasicBlock &LoopMBB) const { | ||
for (auto *Pred : LoopMBB.predecessors()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we use AIELoopUtils::getLoopPredecessor
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought this is used only when the predecessor is a loop or isSingleMBBLoop. This function is called from ZOL Body MBB and we want to get the pre-header MBB.
return 0; | ||
} | ||
|
||
unsigned AIEBaseInstrInfo::getPostZOLRegionSizeFromZOLBody( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel that this Post
is Pre
, or even PreHeader
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is computing the number of bytes after the last ZOL setup instruction in the preheader, I guess name captures that, what is the name you suggest ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe getPostZOLSetupRegionSize
+ this comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done, also moved into a lamda function.
f58c740
to
9106adc
Compare
9106adc
to
618aeae
Compare
618aeae
to
87a5279
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Great work @abnikant !
Avoid inserting extra nop bundles if there are enough bundles between ZOLSetup instructions and LoopEnd to create 112-bytes padding. It takes into account the bundles in the loop body.
In plan: More heuristics to be added 1) Avoid elongating if there are 112-bytes available between writing of (ls, le, and lc) and (loop end - 1), for the case where number of loop bundles < LoopSetupDistance. The EgdeLatency needs to be updated. 2) Elongate the bundles after deducting the available size.
QoR report :

