Skip to content

Add CPUID for AvxVnniInt8 and AvxVnniInt16 #113956

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 25 commits into
base: main
Choose a base branch
from

Conversation

khushal1996
Copy link
Member

@khushal1996 khushal1996 commented Mar 27, 2025

This PR adds support for CPUID for AVX-VNNI-INT8 & AVX-VNNI-INT16 ISAs

Design

image
image

The changes are made in a way to enable the 2 ISAs when

  1. Avx10.2 is enabled or
  2. CPUID for both ISAs are enabled

This is w.r.t the discussions done in API proposal #112586

Testing

Note1: Emitter unit tests not ran since they are added and verified along with AVX10.2 PR #111209

Note2: Superpmi results are not accurate since we are adding a new CPUID and it leads to a new jiteeversionguid. Even after changing the jiteeversion manually, superpmi run shows errors and failures based on the old mch files which can be ignored.

Run JIT subtree with AVXVNNIINT* enabled / disabled


AVXVNNIINT* Enabled
image

AVXVNNIINT* disabled
image

@khushal1996
Copy link
Member Author

@tannergooding This is first of the 2 PRs needed for AVX VNNI INT* API introduction #112586

Comment on lines 815 to 818
{ NI_Illegal, NI_Illegal }, // AVXVNNIINT8
{ NI_Illegal, NI_Illegal }, // AVXVNNIINT8_V512
{ NI_Illegal, NI_Illegal }, // AVXVNNIINT16
{ NI_Illegal, NI_Illegal }, // AVXVNNIINT16_V512
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason we're not adding the APIs at the same time? They look like they should be generally table driven, so it should be a minimal change on top...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can do that too. For all other ISAs, we generally did CPUID and API introduction as separate PRs. Also, it becomes easier to run superpmi once the CPUID PR goes in. Let me know what you'd prefer.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For all other ISAs, we generally did CPUID and API introduction as separate PRs.

For some of the others, like AVX10.2, we've done it incrementally because of the number of APIs and total work required.

That is, checking in the CPUID support first allowed a reduction of conflicts and parallelization of adding a large number of intrinsic APIs across several PRs.

In this case, there's only a very small number of APIs that are likely entirely table driven, so there's little to no risk of conflicts or additional churn.

Doing it all at once lets us build confidence the CPUID checks and end to end story is correct since it is self contained like that and since it allows adding the CPUID and other tests at the same time.

Also, it becomes easier to run superpmi once the CPUID PR goes in

There's not much need to run SPMI for net new intrinsics that nothing is using yet, we're going to get zero diffs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ohkay. Thanks for the review. I will switch this PR to add everything together and then update you.

@khushal1996 khushal1996 force-pushed the kcm-avxvnniint8-cpuid branch from 141d643 to 98fc970 Compare April 14, 2025 21:40
@khushal1996
Copy link
Member Author

@tannergooding @saucecontrol I have added the CPUID, API surface, JIT handling and template tests here.

@tannergooding tannergooding self-requested a review April 14, 2025 21:44
@tannergooding tannergooding self-assigned this Apr 14, 2025
@khushal1996
Copy link
Member Author

@tannergooding this PR is in good shape now. CI failures look unrelated. Can you help review this PR?

Comment on lines +103 to +111
bool emitter::IsAVXVNNIINT8Instruction(instruction ins)
{
return (ins >= INS_FIRST_AVXVNNIINT8_INSTRUCTION) && (ins <= INS_LAST_AVXVNNIINT8_INSTRUCTION);
}

bool emitter::IsAVXVNNIINT16Instruction(instruction ins)
{
return (ins >= INS_FIRST_AVXVNNIINT16_INSTRUCTION) && (ins <= INS_LAST_AVXVNNIINT16_INSTRUCTION);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the likelihood that these ISAs are provided separately in the real world? Is it something we want to support in the JIT today?

With the simplification I'm doing in: #115983, I'd basically like to consider places that are unlikely to be impactful to real world scenarios where we can simplify the JIT and overall support.


For example, 115983 is collapsing X86BASE+SSE+SSE2 into just X86BASE because they are required to be provided together and form the baseline. It is collapsing AVX512F+BW+CD+DQ+VL into simply AVX512 for similar reasons.

I expect we could also collapse AVX2+FMA+BMI1+BMI2 into a similar joined set since no hardware has ever existed that provided any of these independently and allowing individual light-up isn't meaningful for the JIT. Allowing for AVX to be standalone from the rest is beneficial, however.

I would expect that allowing AVXVNNI to be standalone is similarly beneficial. However, I'd expect that in the practical AVXVNNIINT8/16 will always be provided together and we might be able to represent these in the JIT as a single ISA.

It's also unclear given the change in unification strategy for Avx10 and it now requiring V512 support if we will ever actually see hardware that doesn't provide them together. -- We do want the managed API surface to be correct, we still want AvxVnniInt8 and AvxVnniInt16 to exist and expose the actual CPUID bits. These questions are more about if we can simplify the JIT support, the R2R and NAOT checks, etc to be a little more practical given expected real world usecases.

Comment on lines +343 to +345
// Evex versions of AvxVnniInt8 and AvxVnniInt16 will be supported
// with Avx10.2 ISA.
return emitComp->compOpportunisticallyDependsOn(InstructionSet_AVX10v2);
Copy link
Member

@tannergooding tannergooding May 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this correct?

Will there be hardware where EVEX is supported and AvxVnniInt8/16 are supported, but EVEX encoding of AvxVnniInt8/16 is not supported?

I'd expect that the nuance is much like Gfni or Vpclmul where if EVEX is supported, then the EVEX encodings of all EVEX encodable instructions must be supported

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was kept due to the fact that in the PRE-AVX10.2 hardware, we only had VEX support for AVXVNNIINT8/16. If we look at the manual, AVXVNNIINT8/16 only have VEX whereas AVX10.2 introduces same instructions with EVEX support. Hence, here, when we are enabling AVXVNNIINT8/16, we should expect that in PRE-AVX10.2 hardware, we will see only VEX support.

AVXVNNIINT16
image
image

AVX10.2
image
image

CPUID
image

I will be adding CPUID detection for AVX10_VNNI_INT to enable the same ISAs. But at this point, I think it is better to have both VEX and EVEX support.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was kept due to the fact that in the PRE-AVX10.2 hardware, we only had VEX support for AVXVNNIINT8/16. If we look at the manual, AVXVNNIINT8/16 only have VEX whereas AVX10.2 introduces same instructions with EVEX support. Hence, here, when we are enabling AVXVNNIINT8/16, we should expect that in PRE-AVX10.2 hardware, we will see only VEX support.

Right, but what I'm interested in is if we expect any hardware where EVEX is supported and AVXVNNIINT8/16 is supported, but EVEX encoded AVXVNNIINT8/16 is not supported.

If we expect this to be non-existent, but technically allowed, then I think we can just configure the cpufeature checks to account for it and report the feature as unavailable in the unlikely scenario that occurs. This would allow us to simplify the JIT support while still supporting the primary real world scenarios.

If we expect it to actually occur in the wild, particularly if we know of real CPUs that will be in such a setup, then the more verbose handling is justified.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So as per the manual, we can have EVEX but no support for EVEX encoded AVXVNNIINT8/16. But I will let you know once I confirm that real world scenarios we can have in hardware.

Copy link
Member Author

@khushal1996 khushal1996 May 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tannergooding Ohkay. So we know the following things so far,

  • AVXVNNIINT8/16 cannot be merged because they are present on different machines.
    image

  • There will not be a case where EVEX is supported but AVXVNNIINT8/16 are present (Prior to Avx10.2)

  • With AVX10.2, we will have a new CPUID AVX10_VNNI_INT which can enable these ISAs. So with AVX10.2, we will have EVEX versions of these ISAs supported.

This means that we cannot combine the checks for these ISAs into a single ISA since there VEX versions prior to AVX10.2 can be available independently.

With AVX10.2 we would enable them with AVX10_VNNI_INT CPUID bit and handle EVEX support as it is configured right now in HasEvexEncoding

Let me know if this makes sense.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With AVX10.2, we will have a new CPUID AVX10_VNNI_INT which can enable these ISAs. So with AVX10.2, we will have EVEX versions of these ISAs supported.

Does this mean you can have AVX10.1 + AVX_VNNI_INT?


The general support matrix here is fairly confusing and the end goal of my question chain here is to try and simplify the number of switches and complex paths we need to support in the JIT, NAOT startup, and R2R image header.

For example, while there is hardware such as Knight's Landing which supports AVX512 F+CD without also supporting BW+DQ+VL, we made the decision that any AVX512 support required all of F+BW+CD+DQ+VL for any JIT lightup. Cutting out this piece of hardware ended up being worth it to reduce the complexity of the JIT, the test matrix, the performance pitfalls users could have hit, etc. Correspondingly, it is allowing #115983 to occur which is removing nearly 2800 lines of code from the runtime just by combining InstructionSet_AVX512*_ for those sets into simply InstructionSet_AVX512_.

So, I am namely interested in if a similar simplification of AvxVnniInt8/16 can occur given that we have 4 total CPUID bits here, 3 of which can indicate AvxVnniInt8 support and a different 3 of which can indicate AvxVnniInt16 support.

This might just be combining InstructionSet_AvxVnniInt8 and InstructionSet_AvxVnniInt8_V512 into a singular InstructionSet_AvxVnniInt8, because the V512 support is implied by AVX512 support. We could guarantee this in the JIT by not providing AvxVnniInt8/16 support if hardware was encountered that provided the V128/V256 support but which didn't provide the V512 support despite EVEX being supported (i.e. we disable the support if some hardware was encountered with AVXVNNIINT8+AVX512 but which did not provide AVX10_VNNI_INT or AVX10.2).

It might also be saying that the nuance of Sierra Forest/Grand Ridge support AvxVnniInt8 but not AvxVnniInt16 or that the inverse case for Clearwater Forest isn't significant enough and the JIT support should require both to light up either. As a hypothetical, we might say this because Avx10 is supposed to be "unifying" moving forward so future hardware is expected to always be providing them together and the hardware that did exist without both isn't "significant enough" or might cause usability issues/performance pitfalls for developers (not saying this is the case, just speaking hypothetically).

It's not clear given the above what is the "best" decision for .NET 10. We can always revisit things in the future, but if we can simplify the JIT support and test matrix that is all the better. Having to encode and track that AvxVnniInt8 can be enabled by AVXVNNIINT8 or AVX10.2 or AVX10_VNNI_INT is not going to be easy and isn't entirely being handled at the moment, so we need to plot out the support matrix we want to actually support.

@@ -10008,7 +10049,8 @@ void emitter::emitIns_SIMD_R_R_R_A(instruction ins,
GenTreeIndir* indir,
insOpts instOptions)
{
assert(IsFMAInstruction(ins) || IsPermuteVar2xInstruction(ins) || IsAVXVNNIInstruction(ins));
assert(IsFMAInstruction(ins) || IsPermuteVar2xInstruction(ins) || IsAVXVNNIInstruction(ins) ||
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could collapse IsAVXVNNIInstruction(ins) || IsAVXVNNIINT8Instruction(ins) || IsAVXVNNIINT16Instruction(ins) down into some IsAvxVnniFamilyInstruction(ins) given the places that are checking them

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community-contribution Indicates that the PR has been added by a community member needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants