Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions compiler/p/codegen/OMRCodeGenerator.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -1791,6 +1791,9 @@ bool OMR::Power::CodeGenerator::getSupportsOpCodeForAutoSIMD(TR::CPU *cpu, TR::I
case TR::m2v:
// only P9 has splat byte immediate, otherwise it's too expensive
return cpu->isAtLeast(OMR_PROCESSOR_PPC_P9);
case TR::l2m:
TR_ASSERT_FATAL(et == TR::Int16, "Unsupported vector type %s for l2m (must be Int16)\n", et.toString());
return cpu->isAtLeast(OMR_PROCESSOR_PPC_P8);
default:
return false;
}
Expand Down
2 changes: 1 addition & 1 deletion compiler/p/codegen/OMRInstOpCode.enum
Original file line number Diff line number Diff line change
Expand Up @@ -962,7 +962,7 @@
// xvxexpsp, // VSX Vector Extract Exponent SP
// xvxsigdp, // VSX Vector Extract Significand DP
// xvxsigsp, // VSX Vector Extract Significand SP
// xxbrd, // VSX Vector Byte-Reverse Dword
xxbrd, // VSX Vector Byte-Reverse Dword
// xxbrh, // VSX Vector Byte-Reverse Hword
// xxbrw, // VSX Vector Byte-Reverse Word
xxbrq, // VSX Vector Byte-Reverse Qword
Expand Down
21 changes: 10 additions & 11 deletions compiler/p/codegen/OMRInstOpCodeProperties.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -10938,17 +10938,16 @@
/* PPCOpProp_SyncSideEffectFree, */
/* }, */

/* { */
/* .mnemonic = OMR::InstOpCode::xxbrd, */
/* .name = "xxbrd", */
/* .description = "VSX Vector Byte-Reverse Dword", */
/* .prefix = 0x00000000, */
/* .opcode = 0xF017076C, */
/* .format = FORMAT_UNKNOWN, */
/* .minimumALS = OMR_PROCESSOR_PPC_P9, */
/* .properties = PPCOpProp_IsVSX | */
/* PPCOpProp_SyncSideEffectFree, */
/* }, */
{
/* .mnemonic = */ OMR::InstOpCode::xxbrd,
/* .name = */ "xxbrd",
/* .description = "VSX Vector Byte-Reverse Dword", */
/* .prefix = */ 0x00000000,
/* .opcode = */ 0xF017076C,
/* .format = */ FORMAT_XT_XB,
/* .minimumALS = */ OMR_PROCESSOR_PPC_P9,
/* .properties = */ PPCOpProp_IsVSX | PPCOpProp_SyncSideEffectFree,
},

/* { */
/* .mnemonic = OMR::InstOpCode::xxbrh, */
Expand Down
58 changes: 57 additions & 1 deletion compiler/p/codegen/OMRTreeEvaluator.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -1037,7 +1037,63 @@ TR::Register *OMR::Power::TreeEvaluator::i2mEvaluator(TR::Node *node, TR::CodeGe

TR::Register *OMR::Power::TreeEvaluator::l2mEvaluator(TR::Node *node, TR::CodeGenerator *cg)
{
return TR::TreeEvaluator::unImpOpEvaluator(node, cg);
TR::Node *child = node->getFirstChild();

// In order to preserve the boolean array element order on little endian systems, we need to reverse the
// byte/element order of the given input. Due to factors such as instruction availability, there are
// three cases that each need to be handled differently:
// 1.) The child node has refCount == 1
// 2.) The child node has refCount > 1 AND the target system is P9 or higher
// 3.) The child node has refCount > 1 AND the target system is P8 or lower

TR::Register *srcReg;
bool reversed = false;

// Case (1)
if (cg->comp()->target().cpu.isLittleEndian() && child->getReferenceCount() == 1 && child->getRegister() == NULL) {
srcReg = cg->allocateRegister();
TR::LoadStoreHandler::generateLoadNodeSequence(cg, srcReg, child, TR::InstOpCode::ldbrx, 8, true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't need to test if the child node is definitely a memory (load or store) operation? could it be an lregload already, for example?

reversed = true;
} else
srcReg = cg->evaluate(child);

TR::Register *dstReg = cg->allocateRegister(TR_VRF);
TR::Register *tmpReg = cg->allocateRegister(TR_VRF);

node->setRegister(dstReg);

// move to VRF
generateTrg1Src1Instruction(cg, TR::InstOpCode::mtvsrd, node, dstReg, srcReg);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not pre-request that the in-coming double-word is in the right byte order for both LE and BE? and, it can be done very cheaply without heavy lifting below. ld and ldbrx instructions come to mind respectively for BE & LE.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very good point. *2m and m2* are a bit tricky opcodes. Essentially, they are needed for reading/storing a mask from/to a boolean array. We can think if we can combine them with the actual load/store from/to the array but I am sure a few details will need to be worked out.

fyi: @ehsankianifar @BradleyWood

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course, we can also apply the optimization above if the load is available as the child and has reference count of 1 (as we often do).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zl-wang please let us know what you think about treating it as an optimization (see above) ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, i agree with the conclusion that this is essentially an codegen optimization, peeking into the children trees in order to decide what best-performing instructions to generate.

Copy link
Contributor Author

@midronij midronij Oct 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zl-wang @gita-omr I've implemented this optimization for the case where the refcount of the child lload node is 1, and it seems to work without any issues. However, as I understand it, the fix is not quite as simple for the case where refcount is greater than 1, which is less likely to occur but certainly still something we need to take into account.

Earlier we discussed the possibility of essentially un-commoning the lload node, and simply using ldbrx to get the input boolean array without setting the register, but as I understand it, this is a somewhat risky approach to take, and there isn't any past precedent for it anywhere else in the codebase. As well, since there is a way to reverse byte order in a register on P9 and higher (using the xxbrd instruction), the un-commoning approach would only really be relevant to P8 and below.

Since it seems like the conditions in which the un-comming approach would actually be used (refcount >1, P8 and lower) are pretty narrow, is this something we want to pursue? Or alternatively, in the interest of getting these changes merged but still making sure we avoid that cumbersome multi-instruction sequence to manually reverse the element order of the boolean array, maybe it's something we want to add in a different PR?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree thatl2mopcode (and similar) are always used with the corresponding scalar load. It's very unlikely that the load is commoned but l2m is not. So it's a very rare situation that, even if addressed, better to be handled in a separate PR.

// Case (2)
if (!reversed && cg->comp()->target().cpu.isLittleEndian()
&& cg->comp()->target().cpu.isAtLeast(OMR_PROCESSOR_PPC_P9))
generateTrg1Src1Instruction(cg, TR::InstOpCode::xxbrd, node, dstReg, dstReg);

// unpack byte-length elements to halfword-length elements
generateTrg1Src1Instruction(cg, TR::InstOpCode::vupkhsb, node, dstReg, dstReg);

// Case (3)
if (!reversed && cg->comp()->target().cpu.isLittleEndian()
&& !cg->comp()->target().cpu.isAtLeast(OMR_PROCESSOR_PPC_P9)) {
generateTrg1ImmInstruction(cg, TR::InstOpCode::vspltisw, node, tmpReg, -16);
generateTrg1Src2Instruction(cg, TR::InstOpCode::vrlw, node, dstReg, dstReg, tmpReg);
generateTrg1Src2Instruction(cg, TR::InstOpCode::vadduwm, node, tmpReg, tmpReg, tmpReg);
generateTrg1Src2Instruction(cg, TR::InstOpCode::vrld, node, dstReg, dstReg, tmpReg);
generateTrg1Src2ImmInstruction(cg, TR::InstOpCode::xxpermdi, node, dstReg, dstReg, dstReg, 2);
}

// since OMR assumes that boolean values are represented as 0x00 for false and 0x01 for true, we can create an
// all 0/1 mask by subtracting from 0:
// 0-1 = -1 = 0xFF...
// 0-0 = 0
generateTrg1ImmInstruction(cg, TR::InstOpCode::vspltisw, node, tmpReg, 0);
generateTrg1Src2Instruction(cg, TR::InstOpCode::vsubuhm, node, dstReg, tmpReg, dstReg);

cg->stopUsingRegister(tmpReg);
cg->decReferenceCount(child);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since srcReg is possibly not set on child node, decReferenceCount on child node doesn't provide the functionality of managing its liveness. so, you might need to do it here.


return dstReg;
}

TR::Register *OMR::Power::TreeEvaluator::v2mEvaluator(TR::Node *node, TR::CodeGenerator *cg)
Expand Down