-
Notifications
You must be signed in to change notification settings - Fork 5k
Add CPUID for AvxVnniInt8 and AvxVnniInt16 #113956
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
6714249
to
141d643
Compare
@tannergooding This is first of the 2 PRs needed for AVX VNNI INT* API introduction #112586 |
src/coreclr/jit/hwintrinsic.cpp
Outdated
{ NI_Illegal, NI_Illegal }, // AVXVNNIINT8 | ||
{ NI_Illegal, NI_Illegal }, // AVXVNNIINT8_V512 | ||
{ NI_Illegal, NI_Illegal }, // AVXVNNIINT16 | ||
{ NI_Illegal, NI_Illegal }, // AVXVNNIINT16_V512 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason we're not adding the APIs at the same time? They look like they should be generally table driven, so it should be a minimal change on top...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can do that too. For all other ISAs, we generally did CPUID and API introduction as separate PRs. Also, it becomes easier to run superpmi once the CPUID PR goes in. Let me know what you'd prefer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For all other ISAs, we generally did CPUID and API introduction as separate PRs.
For some of the others, like AVX10.2, we've done it incrementally because of the number of APIs and total work required.
That is, checking in the CPUID support first allowed a reduction of conflicts and parallelization of adding a large number of intrinsic APIs across several PRs.
In this case, there's only a very small number of APIs that are likely entirely table driven, so there's little to no risk of conflicts or additional churn.
Doing it all at once lets us build confidence the CPUID checks and end to end story is correct since it is self contained like that and since it allows adding the CPUID and other tests at the same time.
Also, it becomes easier to run superpmi once the CPUID PR goes in
There's not much need to run SPMI for net new intrinsics that nothing is using yet, we're going to get zero diffs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ohkay. Thanks for the review. I will switch this PR to add everything together and then update you.
src/coreclr/tools/Common/JitInterface/ThunkGenerator/InstructionSetDesc.txt
Show resolved
Hide resolved
141d643
to
98fc970
Compare
@tannergooding @saucecontrol I have added the CPUID, API surface, JIT handling and template tests here. |
e6cf454
to
90fa072
Compare
413fa39
to
12d90eb
Compare
@tannergooding this PR is in good shape now. CI failures look unrelated. Can you help review this PR? |
Co-authored-by: Tanner Gooding <[email protected]>
bool emitter::IsAVXVNNIINT8Instruction(instruction ins) | ||
{ | ||
return (ins >= INS_FIRST_AVXVNNIINT8_INSTRUCTION) && (ins <= INS_LAST_AVXVNNIINT8_INSTRUCTION); | ||
} | ||
|
||
bool emitter::IsAVXVNNIINT16Instruction(instruction ins) | ||
{ | ||
return (ins >= INS_FIRST_AVXVNNIINT16_INSTRUCTION) && (ins <= INS_LAST_AVXVNNIINT16_INSTRUCTION); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the likelihood that these ISAs are provided separately in the real world? Is it something we want to support in the JIT today?
With the simplification I'm doing in: #115983, I'd basically like to consider places that are unlikely to be impactful to real world scenarios where we can simplify the JIT and overall support.
For example, 115983 is collapsing X86BASE+SSE+SSE2
into just X86BASE
because they are required to be provided together and form the baseline. It is collapsing AVX512F+BW+CD+DQ+VL
into simply AVX512
for similar reasons.
I expect we could also collapse AVX2+FMA+BMI1+BMI2
into a similar joined set since no hardware has ever existed that provided any of these independently and allowing individual light-up isn't meaningful for the JIT. Allowing for AVX
to be standalone from the rest is beneficial, however.
I would expect that allowing AVXVNNI
to be standalone is similarly beneficial. However, I'd expect that in the practical AVXVNNIINT8/16
will always be provided together and we might be able to represent these in the JIT as a single ISA.
It's also unclear given the change in unification strategy for Avx10 and it now requiring V512 support if we will ever actually see hardware that doesn't provide them together. -- We do want the managed API surface to be correct, we still want AvxVnniInt8 and AvxVnniInt16 to exist and expose the actual CPUID bits. These questions are more about if we can simplify the JIT support, the R2R and NAOT checks, etc to be a little more practical given expected real world usecases.
// Evex versions of AvxVnniInt8 and AvxVnniInt16 will be supported | ||
// with Avx10.2 ISA. | ||
return emitComp->compOpportunisticallyDependsOn(InstructionSet_AVX10v2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this correct?
Will there be hardware where EVEX is supported and AvxVnniInt8/16 are supported, but EVEX encoding of AvxVnniInt8/16 is not supported?
I'd expect that the nuance is much like Gfni
or Vpclmul
where if EVEX
is supported, then the EVEX encodings of all EVEX encodable instructions must be supported
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was kept due to the fact that in the PRE-AVX10.2 hardware, we only had VEX
support for AVXVNNIINT8/16
. If we look at the manual, AVXVNNIINT8/16
only have VEX
whereas AVX10.2
introduces same instructions with EVEX
support. Hence, here, when we are enabling AVXVNNIINT8/16
, we should expect that in PRE-AVX10.2 hardware, we will see only VEX
support.
I will be adding CPUID detection for AVX10_VNNI_INT
to enable the same ISAs. But at this point, I think it is better to have both VEX
and EVEX
support.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was kept due to the fact that in the PRE-AVX10.2 hardware, we only had VEX support for AVXVNNIINT8/16. If we look at the manual, AVXVNNIINT8/16 only have VEX whereas AVX10.2 introduces same instructions with EVEX support. Hence, here, when we are enabling AVXVNNIINT8/16, we should expect that in PRE-AVX10.2 hardware, we will see only VEX support.
Right, but what I'm interested in is if we expect any hardware where EVEX
is supported and AVXVNNIINT8/16
is supported, but EVEX encoded AVXVNNIINT8/16
is not supported.
If we expect this to be non-existent, but technically allowed, then I think we can just configure the cpufeature checks to account for it and report the feature as unavailable in the unlikely scenario that occurs. This would allow us to simplify the JIT support while still supporting the primary real world scenarios.
If we expect it to actually occur in the wild, particularly if we know of real CPUs that will be in such a setup, then the more verbose handling is justified.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So as per the manual, we can have EVEX
but no support for EVEX
encoded AVXVNNIINT8/16
. But I will let you know once I confirm that real world scenarios we can have in hardware.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tannergooding Ohkay. So we know the following things so far,
-
AVXVNNIINT8/16
cannot be merged because they are present on different machines.
-
There will not be a case where
EVEX
is supported butAVXVNNIINT8/16
are present (Prior toAvx10.2
) -
With
AVX10.2
, we will have a new CPUIDAVX10_VNNI_INT
which can enable these ISAs. So withAVX10.2
, we will haveEVEX
versions of these ISAs supported.
This means that we cannot combine the checks for these ISAs into a single ISA since there VEX
versions prior to AVX10.2
can be available independently.
With AVX10.2
we would enable them with AVX10_VNNI_INT
CPUID bit and handle EVEX
support as it is configured right now in HasEvexEncoding
Let me know if this makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With AVX10.2, we will have a new CPUID AVX10_VNNI_INT which can enable these ISAs. So with AVX10.2, we will have EVEX versions of these ISAs supported.
Does this mean you can have AVX10.1 + AVX_VNNI_INT
?
The general support matrix here is fairly confusing and the end goal of my question chain here is to try and simplify the number of switches and complex paths we need to support in the JIT, NAOT startup, and R2R image header.
For example, while there is hardware such as Knight's Landing
which supports AVX512 F+CD
without also supporting BW+DQ+VL
, we made the decision that any AVX512 support required all of F+BW+CD+DQ+VL
for any JIT lightup. Cutting out this piece of hardware ended up being worth it to reduce the complexity of the JIT, the test matrix, the performance pitfalls users could have hit, etc. Correspondingly, it is allowing #115983 to occur which is removing nearly 2800 lines of code from the runtime just by combining InstructionSet_AVX512*_
for those sets into simply InstructionSet_AVX512_
.
So, I am namely interested in if a similar simplification of AvxVnniInt8/16 can occur given that we have 4 total CPUID bits here, 3 of which can indicate AvxVnniInt8 support and a different 3 of which can indicate AvxVnniInt16 support.
This might just be combining InstructionSet_AvxVnniInt8
and InstructionSet_AvxVnniInt8_V512
into a singular InstructionSet_AvxVnniInt8
, because the V512 support is implied by AVX512
support. We could guarantee this in the JIT by not providing AvxVnniInt8/16
support if hardware was encountered that provided the V128/V256 support but which didn't provide the V512 support despite EVEX being supported (i.e. we disable the support if some hardware was encountered with AVXVNNIINT8+AVX512
but which did not provide AVX10_VNNI_INT
or AVX10.2
).
It might also be saying that the nuance of Sierra Forest
/Grand Ridge
support AvxVnniInt8
but not AvxVnniInt16
or that the inverse case for Clearwater Forest
isn't significant enough and the JIT support should require both to light up either. As a hypothetical, we might say this because Avx10 is supposed to be "unifying" moving forward so future hardware is expected to always be providing them together and the hardware that did exist without both isn't "significant enough" or might cause usability issues/performance pitfalls for developers (not saying this is the case, just speaking hypothetically).
It's not clear given the above what is the "best" decision for .NET 10. We can always revisit things in the future, but if we can simplify the JIT support and test matrix that is all the better. Having to encode and track that AvxVnniInt8
can be enabled by AVXVNNIINT8
or AVX10.2
or AVX10_VNNI_INT
is not going to be easy and isn't entirely being handled at the moment, so we need to plot out the support matrix we want to actually support.
@@ -10008,7 +10049,8 @@ void emitter::emitIns_SIMD_R_R_R_A(instruction ins, | |||
GenTreeIndir* indir, | |||
insOpts instOptions) | |||
{ | |||
assert(IsFMAInstruction(ins) || IsPermuteVar2xInstruction(ins) || IsAVXVNNIInstruction(ins)); | |||
assert(IsFMAInstruction(ins) || IsPermuteVar2xInstruction(ins) || IsAVXVNNIInstruction(ins) || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we could collapse IsAVXVNNIInstruction(ins) || IsAVXVNNIINT8Instruction(ins) || IsAVXVNNIINT16Instruction(ins)
down into some IsAvxVnniFamilyInstruction(ins)
given the places that are checking them
This PR adds support for CPUID for
AVX-VNNI-INT8
&AVX-VNNI-INT16
ISAsDesign
The changes are made in a way to enable the 2 ISAs when
Avx10.2
is enabled orThis is w.r.t the discussions done in API proposal #112586
Testing
Note1: Emitter unit tests not ran since they are added and verified along with AVX10.2 PR #111209
Note2: Superpmi results are not accurate since we are adding a new CPUID and it leads to a new jiteeversionguid. Even after changing the jiteeversion manually, superpmi run shows errors and failures based on the old mch files which can be ignored.
Run JIT subtree with AVXVNNIINT* enabled / disabled
AVXVNNIINT* Enabled

AVXVNNIINT* disabled
