-
Notifications
You must be signed in to change notification settings - Fork 409
WIP: Implement vectorized l2m on PPC #7909
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
1107bfb
to
58fb965
Compare
|
||
// move to VRF | ||
generateTrg1Src1Instruction(cg, TR::InstOpCode::mtvsrd, node, dstReg, srcReg); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not pre-request that the in-coming double-word is in the right byte order for both LE and BE? and, it can be done very cheaply without heavy lifting below. ld
and ldbrx
instructions come to mind respectively for BE & LE.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a very good point. *2m
and m2*
are a bit tricky opcodes. Essentially, they are needed for reading/storing a mask from/to a boolean array. We can think if we can combine them with the actual load/store from/to the array but I am sure a few details will need to be worked out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Of course, we can also apply the optimization above if the load is available as the child and has reference count of 1 (as we often do).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zl-wang please let us know what you think about treating it as an optimization (see above) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, i agree with the conclusion that this is essentially an codegen optimization, peeking into the children trees in order to decide what best-performing instructions to generate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zl-wang @gita-omr I've implemented this optimization for the case where the refcount of the child lload
node is 1, and it seems to work without any issues. However, as I understand it, the fix is not quite as simple for the case where refcount is greater than 1, which is less likely to occur but certainly still something we need to take into account.
Earlier we discussed the possibility of essentially un-commoning the lload
node, and simply using ldbrx
to get the input boolean array without setting the register, but as I understand it, this is a somewhat risky approach to take, and there isn't any past precedent for it anywhere else in the codebase. As well, since there is a way to reverse byte order in a register on P9 and higher (using the xxbrd
instruction), the un-commoning approach would only really be relevant to P8 and below.
Since it seems like the conditions in which the un-comming approach would actually be used (refcount >1, P8 and lower) are pretty narrow, is this something we want to pursue? Or alternatively, in the interest of getting these changes merged but still making sure we avoid that cumbersome multi-instruction sequence to manually reverse the element order of the boolean array, maybe it's something we want to add in a different PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree thatl2m
opcode (and similar) are always used with the corresponding scalar load. It's very unlikely that the load is commoned but l2m
is not. So it's a very rare situation that, even if addressed, better to be handled in a separate PR.
b6f49ad
to
ad5dc2b
Compare
84e7eb9
to
9560324
Compare
32aff9c
to
50233d1
Compare
Signed-off-by: midronij <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
otherwise, it looks good to me.
// Case (1) | ||
if (cg->comp()->target().cpu.isLittleEndian() && child->getReferenceCount() == 1 && child->getRegister() == NULL) { | ||
srcReg = cg->allocateRegister(); | ||
TR::LoadStoreHandler::generateLoadNodeSequence(cg, srcReg, child, TR::InstOpCode::ldbrx, 8, true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you don't need to test if the child node is definitely a memory (load or store) operation? could it be an lregload already, for example?
generateTrg1Src2Instruction(cg, TR::InstOpCode::vsubuhm, node, dstReg, tmpReg, dstReg); | ||
|
||
cg->stopUsingRegister(tmpReg); | ||
cg->decReferenceCount(child); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since srcReg is possibly not set on child node, decReferenceCount on child node doesn't provide the functionality of managing its liveness. so, you might need to do it here.
Implement PPC codegen for
l2m
(Long to Mask) on P8+. This operation accepts eight byte elements of a given boolean array (read from memory using a doubleword load) and converts it into a ShortVector mask with the corresponding boolean values.