Skip to content

Conversation

midronij
Copy link
Contributor

@midronij midronij commented Aug 26, 2025

Implement PPC codegen for l2m (Long to Mask) on P8+. This operation accepts eight byte elements of a given boolean array (read from memory using a doubleword load) and converts it into a ShortVector mask with the corresponding boolean values.

@midronij midronij changed the title WIP: Implement PPC codegen for l2m Implement PPC codegen for l2m Sep 9, 2025
@midronij midronij force-pushed the l2m branch 2 times, most recently from 1107bfb to 58fb965 Compare September 9, 2025 19:59
@midronij midronij changed the title Implement PPC codegen for l2m WIP: Implement PPC codegen for l2m Sep 12, 2025
@midronij midronij changed the title WIP: Implement PPC codegen for l2m WIP: Implement vectorized l2m on PPC Sep 18, 2025
@midronij midronij changed the title WIP: Implement vectorized l2m on PPC Implement vectorized l2m on PPC Sep 23, 2025
@midronij
Copy link
Contributor Author

@gita-omr @zl-wang could you please review when you have a chance?


// move to VRF
generateTrg1Src1Instruction(cg, TR::InstOpCode::mtvsrd, node, dstReg, srcReg);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not pre-request that the in-coming double-word is in the right byte order for both LE and BE? and, it can be done very cheaply without heavy lifting below. ld and ldbrx instructions come to mind respectively for BE & LE.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very good point. *2m and m2* are a bit tricky opcodes. Essentially, they are needed for reading/storing a mask from/to a boolean array. We can think if we can combine them with the actual load/store from/to the array but I am sure a few details will need to be worked out.

fyi: @ehsankianifar @BradleyWood

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course, we can also apply the optimization above if the load is available as the child and has reference count of 1 (as we often do).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zl-wang please let us know what you think about treating it as an optimization (see above) ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, i agree with the conclusion that this is essentially an codegen optimization, peeking into the children trees in order to decide what best-performing instructions to generate.

Copy link
Contributor Author

@midronij midronij Oct 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zl-wang @gita-omr I've implemented this optimization for the case where the refcount of the child lload node is 1, and it seems to work without any issues. However, as I understand it, the fix is not quite as simple for the case where refcount is greater than 1, which is less likely to occur but certainly still something we need to take into account.

Earlier we discussed the possibility of essentially un-commoning the lload node, and simply using ldbrx to get the input boolean array without setting the register, but as I understand it, this is a somewhat risky approach to take, and there isn't any past precedent for it anywhere else in the codebase. As well, since there is a way to reverse byte order in a register on P9 and higher (using the xxbrd instruction), the un-commoning approach would only really be relevant to P8 and below.

Since it seems like the conditions in which the un-comming approach would actually be used (refcount >1, P8 and lower) are pretty narrow, is this something we want to pursue? Or alternatively, in the interest of getting these changes merged but still making sure we avoid that cumbersome multi-instruction sequence to manually reverse the element order of the boolean array, maybe it's something we want to add in a different PR?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree thatl2mopcode (and similar) are always used with the corresponding scalar load. It's very unlikely that the load is commoned but l2m is not. So it's a very rare situation that, even if addressed, better to be handled in a separate PR.

@midronij midronij force-pushed the l2m branch 2 times, most recently from b6f49ad to ad5dc2b Compare September 26, 2025 14:23
@midronij midronij force-pushed the l2m branch 3 times, most recently from 84e7eb9 to 9560324 Compare October 9, 2025 18:52
@midronij midronij force-pushed the l2m branch 2 times, most recently from 32aff9c to 50233d1 Compare October 15, 2025 14:21
Copy link
Contributor

@zl-wang zl-wang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

otherwise, it looks good to me.

// Case (1)
if (cg->comp()->target().cpu.isLittleEndian() && child->getReferenceCount() == 1 && child->getRegister() == NULL) {
srcReg = cg->allocateRegister();
TR::LoadStoreHandler::generateLoadNodeSequence(cg, srcReg, child, TR::InstOpCode::ldbrx, 8, true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't need to test if the child node is definitely a memory (load or store) operation? could it be an lregload already, for example?

generateTrg1Src2Instruction(cg, TR::InstOpCode::vsubuhm, node, dstReg, tmpReg, dstReg);

cg->stopUsingRegister(tmpReg);
cg->decReferenceCount(child);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since srcReg is possibly not set on child node, decReferenceCount on child node doesn't provide the functionality of managing its liveness. so, you might need to do it here.

@midronij midronij changed the title Implement vectorized l2m on PPC WIP: Implement vectorized l2m on PPC Oct 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants