-
Notifications
You must be signed in to change notification settings - Fork 5k
Arm64 SVE: Better optimise zero/allbits vectors #115566
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Fixes dotnet#114443 * IsVectorZero() should allow for all zero vectors and false masks that have been converted to vectors. * IsVectorAllBitsSet() should allow for all bits set vectors and true masks that have been converted to vectors. * IsMaskZero() should all for false masks and all zero vectors that have been converted to masks. * IsMaskAllBitsSet() should allow for true masks and all bit set vectors that have been converted to masks. In addition: * Fix up all the errors caused by these changes. * Add a bunch of asmcheck tests
src/coreclr/jit/lowerarmarch.cpp
Outdated
{ | ||
// When we are merging with zero, we can specialize | ||
// and avoid instantiating the vector constant. | ||
// Do this only if op1 was AllTrueMask | ||
MakeSrcContained(node, op3); | ||
if (op3->OperIsConvertMaskToVector()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wonder if we can go as far as putting the OperIsConvertMaskToVector()
check inside MakeSrcContained()
(and also one for OperIsConvertVectorToMask()
)
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch |
Currently there are some issues around Non Faulting LoadVectors and and I need to check I've not created any code size regressions |
src/coreclr/jit/gentree.h
Outdated
|
||
#if defined(TARGET_ARM64) | ||
// Can also be an all true mask that has been converted to a vector. | ||
if (OperIsConvertMaskToVector() && AsHWIntrinsic()->Op(1)->OperIsTrueMask()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't feel correct. We have CnsVec
and we have CnsMask
.
A ConvertMaskToVector(CnsMsk)
should just be folded to CnsVec
instead; and vice-versa.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now, we're essentially treating this two differently:
Sve.ConditionalSelect(Sve.CreateTrueMaskInt32(), a, b);
Sve.ConditionalSelect(Vector<int>.AllBitsSet, a, b);
Where the second will be imported with a ConvertVectorToMask()
on arg1. Are you suggesting the AllBitsSet
and convert
should be folded into a CreateTrueMaskInt32()
?
When do you suggest that should be done?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sve.CreateTrueMaskInt32()
should rather likely be imported directly as GT_CNS_VEC
instead, that way it has the same normalized representation and can participate in various IsConstant
checks, constant folding, and other optimizations without special handling.
We have both GT_CNS_VEC
for TYP_SIMD
constants and GT_CNS_MSK
for TYP_MASK
constants. We already have general purpose handling that recognizes ConvertMaskToVector(GT_CNS_MSK)
and ConvertVectorToMask(GT_CNS_VEC)
here: https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/gentree.cpp#L32269-L32324
So provided you're importing it as a constant (because that is what it creates) then everything should be fine. You can then optimize codegen for GT_CNS_MSK
to generate PTRUE
/PFALSE
in the scenarios that need it, just as we optimize GT_CNS_VEC
to special case certain constants as well (i.e. using mvni
for AllBitsSet
and movi
for Zero
): https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/codegenarm64.cpp#L2268-L2340
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sve.CreateTrueMaskInt32()
should rather likely be imported directly asGT_CNS_VEC
instead, that way it has the same normalized representation and can participate in variousIsConstant
checks, constant folding, and other optimizations without special handling.
Did you mean GT_CNS_MSK
here? (otherwise it would get imported as GT_CNS_VEC
plus a ConvertVectorToMask
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you mean GT_CNS_MSK here?
No.
All SVE.CreateTrueMaskInt32
maps to a managed signature that returns Vector<T>
and so are imported as a node producing a TYP_SIMD
. This is what the HW_Flag_ReturnsPerElementMask
flag ends up doing. Some other intrinsics, then take a parameter as a mask, which is what the HW_Flag_ExplicitMaskedOperation
is for. This causes that parameter to be wrapped in ConvertVectorToMask
as all nodes at this point in time are TYP_SIMD
due to the managed signature taking Vector<T>
.
This means even though the underlying instruction for CreateTrueMaskInt32
may produce a mask, the IR we import today is (rough pseudo-code):
GT_HWINTRINSIC ConvertMaskToVector TYP_SIMD
+----- GT_HWINTRINSIC CreateTrueMaskInt32 TYP_MASK
This can be simplified to just directly producing a GT_CNS_VEC
that is AllBitsSet
.
The existing constant folding support for ConvertVectorToMask(GT_CNS_VEC)
is then what allows this to later become a GT_CNS_MSK
if something actually needs to consume it as a mask (like NI_Sve_TestFirstTrue
). The inverse is true as well for anything that produces a constant mask if the actual consumer wants a vector as it is handled by ConvertMaskToVector(GT_CNS_MSK)
.
There is likewise folding for ConvertVectorToMask(ConvertMaskToVector(x))
and ConvertMaskToVector(ConvertVectorToMask(x))
to just consume x
directly which ensures that non-constants can also be directly consumed without unnecessary conversions.
This split between the managed signature
(which take Vector<T>
and therefore expect TYP_SIMD
inputs/returns) and the underlying native instruction is why (which expect TYP_MASK
inputs/returns) is why on x64 we have both NI_AVX512_*
and NI_EVEX_*
definitions. For example, we define both:
HARDWARE_INTRINSIC(AVX512F, BlendVariable, 64, 3, {INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid}, HW_Category_SimpleSIMD, HW_Flag_InvalidNodeId)
HARDWARE_INTRINSIC(EVEX, BlendVariableMask, -1, 3, {INS_vpblendmb, INS_vpblendmb, INS_vpblendmw, INS_vpblendmw, INS_vpblendmd, INS_vpblendmd, INS_vpblendmq, INS_vpblendmq, INS_vblendmps, INS_vblendmpd}, HW_Category_SimpleSIMD, HW_Flag_EmbBroadcastCompatible)
The former maps to the managed signature and is marked InvalidNodeId
specifically because it doesn't map to the underlying instruction and so isn't representative of what we want to track in the actual IR, only of what we want to map from the managed API surface as exposed via IL.
The latter maps to the instruction and so is what we want to actually track for the IR. It directly takes/returns the TYP_MASK
nodes which are only available to the JIT. The relevant folding, value numbering, and other support then ensures everything works cleanly and is optimized to the efficient codegen we expect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All
SVE.CreateTrueMaskInt32
maps to a managed signature that returnsVector<T>
and so are imported as a node producing aTYP_SIMD
.
Yes, understood now. Agreed this seems to be the better way to do it, but I suspect it is going to be tricky to get this working in all the existing code we've already got.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but I suspect it is going to be tricky to get this working in all the existing code we've already got.
Actually, this seems straightforward. Seems to work fine for zero mask. Asmdiff is only showing reductions. Still need to do true mask and fix up all the tests.
Fixes #114443
In addition: