-
Notifications
You must be signed in to change notification settings - Fork 409
WIP: Implement vectorized l2m on PPC #7909
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -1037,7 +1037,63 @@ TR::Register *OMR::Power::TreeEvaluator::i2mEvaluator(TR::Node *node, TR::CodeGe | |
|
||
TR::Register *OMR::Power::TreeEvaluator::l2mEvaluator(TR::Node *node, TR::CodeGenerator *cg) | ||
{ | ||
return TR::TreeEvaluator::unImpOpEvaluator(node, cg); | ||
TR::Node *child = node->getFirstChild(); | ||
|
||
// In order to preserve the boolean array element order on little endian systems, we need to reverse the | ||
// byte/element order of the given input. Due to factors such as instruction availability, there are | ||
// three cases that each need to be handled differently: | ||
// 1.) The child node has refCount == 1 | ||
// 2.) The child node has refCount > 1 AND the target system is P9 or higher | ||
// 3.) The child node has refCount > 1 AND the target system is P8 or lower | ||
|
||
TR::Register *srcReg; | ||
bool reversed = false; | ||
|
||
// Case (1) | ||
if (cg->comp()->target().cpu.isLittleEndian() && child->getReferenceCount() == 1 && child->getRegister() == NULL) { | ||
srcReg = cg->allocateRegister(); | ||
TR::LoadStoreHandler::generateLoadNodeSequence(cg, srcReg, child, TR::InstOpCode::ldbrx, 8, true); | ||
reversed = true; | ||
} else | ||
srcReg = cg->evaluate(child); | ||
|
||
TR::Register *dstReg = cg->allocateRegister(TR_VRF); | ||
TR::Register *tmpReg = cg->allocateRegister(TR_VRF); | ||
|
||
node->setRegister(dstReg); | ||
|
||
// move to VRF | ||
midronij marked this conversation as resolved.
Show resolved
Hide resolved
|
||
generateTrg1Src1Instruction(cg, TR::InstOpCode::mtvsrd, node, dstReg, srcReg); | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. why not pre-request that the in-coming double-word is in the right byte order for both LE and BE? and, it can be done very cheaply without heavy lifting below. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is a very good point. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Of course, we can also apply the optimization above if the load is available as the child and has reference count of 1 (as we often do). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @zl-wang please let us know what you think about treating it as an optimization (see above) ? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yes, i agree with the conclusion that this is essentially an codegen optimization, peeking into the children trees in order to decide what best-performing instructions to generate. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @zl-wang @gita-omr I've implemented this optimization for the case where the refcount of the child Earlier we discussed the possibility of essentially un-commoning the Since it seems like the conditions in which the un-comming approach would actually be used (refcount >1, P8 and lower) are pretty narrow, is this something we want to pursue? Or alternatively, in the interest of getting these changes merged but still making sure we avoid that cumbersome multi-instruction sequence to manually reverse the element order of the boolean array, maybe it's something we want to add in a different PR? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree that |
||
// Case (2) | ||
if (!reversed && cg->comp()->target().cpu.isLittleEndian() | ||
&& cg->comp()->target().cpu.isAtLeast(OMR_PROCESSOR_PPC_P9)) | ||
generateTrg1Src1Instruction(cg, TR::InstOpCode::xxbrd, node, dstReg, dstReg); | ||
|
||
// unpack byte-length elements to halfword-length elements | ||
generateTrg1Src1Instruction(cg, TR::InstOpCode::vupkhsb, node, dstReg, dstReg); | ||
|
||
// Case (3) | ||
if (!reversed && cg->comp()->target().cpu.isLittleEndian() | ||
&& !cg->comp()->target().cpu.isAtLeast(OMR_PROCESSOR_PPC_P9)) { | ||
generateTrg1ImmInstruction(cg, TR::InstOpCode::vspltisw, node, tmpReg, -16); | ||
generateTrg1Src2Instruction(cg, TR::InstOpCode::vrlw, node, dstReg, dstReg, tmpReg); | ||
generateTrg1Src2Instruction(cg, TR::InstOpCode::vadduwm, node, tmpReg, tmpReg, tmpReg); | ||
generateTrg1Src2Instruction(cg, TR::InstOpCode::vrld, node, dstReg, dstReg, tmpReg); | ||
generateTrg1Src2ImmInstruction(cg, TR::InstOpCode::xxpermdi, node, dstReg, dstReg, dstReg, 2); | ||
} | ||
|
||
// since OMR assumes that boolean values are represented as 0x00 for false and 0x01 for true, we can create an | ||
// all 0/1 mask by subtracting from 0: | ||
// 0-1 = -1 = 0xFF... | ||
// 0-0 = 0 | ||
generateTrg1ImmInstruction(cg, TR::InstOpCode::vspltisw, node, tmpReg, 0); | ||
generateTrg1Src2Instruction(cg, TR::InstOpCode::vsubuhm, node, dstReg, tmpReg, dstReg); | ||
|
||
cg->stopUsingRegister(tmpReg); | ||
cg->decReferenceCount(child); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. since srcReg is possibly not set on child node, decReferenceCount on child node doesn't provide the functionality of managing its liveness. so, you might need to do it here. |
||
|
||
return dstReg; | ||
} | ||
|
||
TR::Register *OMR::Power::TreeEvaluator::v2mEvaluator(TR::Node *node, TR::CodeGenerator *cg) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you don't need to test if the child node is definitely a memory (load or store) operation? could it be an lregload already, for example?