Skip to content

Arm64 SVE: Better optimise zero/allbits vectors #115566

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

a74nh
Copy link
Contributor

@a74nh a74nh commented May 14, 2025

Fixes #114443

  • IsVectorZero() should allow for all zero vectors and false masks that have been converted to vectors.
  • IsVectorAllBitsSet() should allow for all bits set vectors and true masks that have been converted to vectors.
  • IsMaskZero() should allow for false masks and all zero vectors that have been converted to masks.
  • IsMaskAllBitsSet() should allow for true masks and all bit set vectors that have been converted to masks.

In addition:

  • Fix up all the errors caused by these changes.
  • Add a bunch of asmcheck tests

Fixes dotnet#114443

* IsVectorZero() should allow for all zero vectors and false masks that have been converted to vectors.
* IsVectorAllBitsSet() should allow for all bits set vectors and true masks that have been converted to vectors.
* IsMaskZero() should all for false masks and all zero vectors that have been converted to masks.
* IsMaskAllBitsSet() should allow for true masks and all bit set vectors that have been converted to masks.

In addition:
* Fix up all the errors caused by these changes.
* Add a bunch of asmcheck tests
@dotnet-policy-service dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label May 14, 2025
@github-actions github-actions bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label May 14, 2025
{
// When we are merging with zero, we can specialize
// and avoid instantiating the vector constant.
// Do this only if op1 was AllTrueMask
MakeSrcContained(node, op3);
if (op3->OperIsConvertMaskToVector())
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wonder if we can go as far as putting the OperIsConvertMaskToVector() check inside MakeSrcContained() (and also one for OperIsConvertVectorToMask())

Copy link
Contributor

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

@a74nh a74nh added arch-arm64 arm-sve Work related to arm64 SVE/SVE2 support labels May 14, 2025
@a74nh
Copy link
Contributor Author

a74nh commented May 14, 2025

Currently there are some issues around Non Faulting LoadVectors and and I need to check I've not created any code size regressions


#if defined(TARGET_ARM64)
// Can also be an all true mask that has been converted to a vector.
if (OperIsConvertMaskToVector() && AsHWIntrinsic()->Op(1)->OperIsTrueMask())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't feel correct. We have CnsVec and we have CnsMask.

A ConvertMaskToVector(CnsMsk) should just be folded to CnsVec instead; and vice-versa.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now, we're essentially treating this two differently:
Sve.ConditionalSelect(Sve.CreateTrueMaskInt32(), a, b);
Sve.ConditionalSelect(Vector<int>.AllBitsSet, a, b);

Where the second will be imported with a ConvertVectorToMask() on arg1. Are you suggesting the AllBitsSet and convert should be folded into a CreateTrueMaskInt32() ?

When do you suggest that should be done?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sve.CreateTrueMaskInt32() should rather likely be imported directly as GT_CNS_VEC instead, that way it has the same normalized representation and can participate in various IsConstant checks, constant folding, and other optimizations without special handling.

We have both GT_CNS_VEC for TYP_SIMD constants and GT_CNS_MSK for TYP_MASK constants. We already have general purpose handling that recognizes ConvertMaskToVector(GT_CNS_MSK) and ConvertVectorToMask(GT_CNS_VEC) here: https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/gentree.cpp#L32269-L32324

So provided you're importing it as a constant (because that is what it creates) then everything should be fine. You can then optimize codegen for GT_CNS_MSK to generate PTRUE/PFALSE in the scenarios that need it, just as we optimize GT_CNS_VEC to special case certain constants as well (i.e. using mvni for AllBitsSet and movi for Zero): https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/codegenarm64.cpp#L2268-L2340

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sve.CreateTrueMaskInt32() should rather likely be imported directly as GT_CNS_VEC instead, that way it has the same normalized representation and can participate in various IsConstant checks, constant folding, and other optimizations without special handling.

Did you mean GT_CNS_MSK here? (otherwise it would get imported as GT_CNS_VEC plus a ConvertVectorToMask)

Copy link
Member

@tannergooding tannergooding May 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean GT_CNS_MSK here?

No.

All SVE.CreateTrueMaskInt32 maps to a managed signature that returns Vector<T> and so are imported as a node producing a TYP_SIMD. This is what the HW_Flag_ReturnsPerElementMask flag ends up doing. Some other intrinsics, then take a parameter as a mask, which is what the HW_Flag_ExplicitMaskedOperation is for. This causes that parameter to be wrapped in ConvertVectorToMask as all nodes at this point in time are TYP_SIMD due to the managed signature taking Vector<T>.

This means even though the underlying instruction for CreateTrueMaskInt32 may produce a mask, the IR we import today is (rough pseudo-code):

GT_HWINTRINSIC    ConvertMaskToVector    TYP_SIMD
+-----  GT_HWINTRINSIC    CreateTrueMaskInt32    TYP_MASK

This can be simplified to just directly producing a GT_CNS_VEC that is AllBitsSet.

The existing constant folding support for ConvertVectorToMask(GT_CNS_VEC) is then what allows this to later become a GT_CNS_MSK if something actually needs to consume it as a mask (like NI_Sve_TestFirstTrue). The inverse is true as well for anything that produces a constant mask if the actual consumer wants a vector as it is handled by ConvertMaskToVector(GT_CNS_MSK).

There is likewise folding for ConvertVectorToMask(ConvertMaskToVector(x)) and ConvertMaskToVector(ConvertVectorToMask(x)) to just consume x directly which ensures that non-constants can also be directly consumed without unnecessary conversions.


This split between the managed signature (which take Vector<T> and therefore expect TYP_SIMD inputs/returns) and the underlying native instruction is why (which expect TYP_MASK inputs/returns) is why on x64 we have both NI_AVX512_* and NI_EVEX_* definitions. For example, we define both:

HARDWARE_INTRINSIC(AVX512F,         BlendVariable,                              64,              3,     {INS_invalid,           INS_invalid,            INS_invalid,            INS_invalid,            INS_invalid,            INS_invalid,            INS_invalid,            INS_invalid,            INS_invalid,            INS_invalid},           HW_Category_SimpleSIMD,             HW_Flag_InvalidNodeId)

HARDWARE_INTRINSIC(EVEX,            BlendVariableMask,                          -1,              3,     {INS_vpblendmb,         INS_vpblendmb,          INS_vpblendmw,          INS_vpblendmw,          INS_vpblendmd,          INS_vpblendmd,          INS_vpblendmq,          INS_vpblendmq,          INS_vblendmps,          INS_vblendmpd},         HW_Category_SimpleSIMD,             HW_Flag_EmbBroadcastCompatible)

The former maps to the managed signature and is marked InvalidNodeId specifically because it doesn't map to the underlying instruction and so isn't representative of what we want to track in the actual IR, only of what we want to map from the managed API surface as exposed via IL.

The latter maps to the instruction and so is what we want to actually track for the IR. It directly takes/returns the TYP_MASK nodes which are only available to the JIT. The relevant folding, value numbering, and other support then ensures everything works cleanly and is optimized to the efficient codegen we expect.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All SVE.CreateTrueMaskInt32 maps to a managed signature that returns Vector<T> and so are imported as a node producing a TYP_SIMD.

Yes, understood now. Agreed this seems to be the better way to do it, but I suspect it is going to be tricky to get this working in all the existing code we've already got.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but I suspect it is going to be tricky to get this working in all the existing code we've already got.

Actually, this seems straightforward. Seems to work fine for zero mask. Asmdiff is only showing reductions. Still need to do true mask and fix up all the tests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arch-arm64 area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI arm-sve Work related to arm64 SVE/SVE2 support community-contribution Indicates that the PR has been added by a community member
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ARM64-SVE: ConditionalSelect IsVectorZero check will never trigger for Op1
2 participants