The CoreCLR runtime has support for several varieties of hardware intrinsics, and various ways to compile code which uses them. This support varies by target processor, and the code produced depends on how the jit compiler is invoked. This document describes the various behaviors of intrinsics in the runtime, and concludes with implications for developers working on the runtime and libraries portions of the runtime.
Acronym | Definition |
---|---|
AOT | Ahead of time. In this document, it refers to compiling code before the process launches and saving it into a file for later use. |
Most hardware intrinsics support is tied to the use of various Vector apis. There are 4 major api surfaces that are supported by the runtime
- The fixed length float vectors.
Vector2
,Vector3
, andVector4
. These vector types represent a struct of floats of various lengths. For type layout, ABI and, interop purposes they are represented in exactly the same way as a structure with an appropriate number of floats in it. Operations on these vector types are supported on all architectures and platforms, although some architectures may optimize various operations. - The variable length
Vector<T>
. This represents vector data of runtime-determined length. In any given process the length of aVector<T>
is the same in all methods, but this length may differ between various machines or environment variable settings read at startup of the process. TheT
type variable may be the following types (System.Byte
,System.SByte
,System.Int16
,System.UInt16
,System.Int32
,System.UInt32
,System.Int64
,System.UInt64
,System.Single
, andSystem.Double
), and allows use of integer or double data within a vector. The length and alignment ofVector<T>
is unknown to the developer at compile time (although discoverable at runtime by using theVector<T>.Count
api), andVector<T>
may not exist in any interop signature. Operations on these vector types are supported on all architectures and platforms, although some architectures may optimize various operations if theVector<T>.IsHardwareAccelerated
api returns true. Vector64<T>
,Vector128<T>
, andVector256<T>
represent fixed-sized vectors that closely resemble the fixed- sized vectors available in C++. These structures can be used in any code that runs, but very few features are supported directly on these types other than creation. They are used primarily in the processor specific hardware intrinsics apis.- Processor specific hardware intrinsics apis such as
System.Runtime.Intrinsics.X86.Ssse3
. These apis map directly to individual instructions or short instruction sequences that are specific to a particular hardware instruction. These apis are only useable on hardware that supports the particular instruction. See https://github.com/dotnet/designs/blob/master/accepted/2018/platform-intrinsics.md for the design of these.
There are 3 models for use of intrinsics apis.
- Usage of
Vector2
,Vector3
,Vector4
, andVector<T>
. For these, its always safe to just use the types. The jit will generate code that is as optimal as it can for the logic, and will do so unconditionally. - Usage of
Vector64<T>
,Vector128<T>
, andVector256<T>
. These types may be used unconditionally, but are only truly useful when also using the platform specific hardware intrinsics apis. - Usage of platform intrinsics apis. All usage of these apis should be wrapped in an
IsSupported
check of the appropriate kind. Then, within theIsSupported
check the platform specific api may be used. If multiple instruction sets are used, then the application developer must have checks for the instruction sets as used on each one of them.
Hardware intrinsics have dramatic impacts on codegen, and the codegen of these hardware intrinsics is dependent on the ISA available for the target machine when the code is compiled.
If the code is compiled at runtime by the JIT in a just-in-time manner, then the JIT will generate the best code it can based on the current processor's ISA. This use of hardware intrinsics is indendent of jit compilation tier. MethodImplOptions.AggressiveOptimization
may be used to bypass compilation of tier 0 code and always produce tier 1 code for the method. In addition, the current policy of the runtime is that MethodImplOptions.AggressiveOptimization
may also be used to bypass compilation of code as R2R code, although that may change in the future.
For AOT compilation, the situation is far more complex. This is due to the following principles of how our AOT compilation model works.
- AOT compilation must never under any circumstance change the semantic behavior of code except for changes in performance.
- If AOT code is generated, it should be used unless there is an overriding reason to avoid using it.
- It must be exceedingly difficult to misuse the AOT compilation tool to violate principle 1.
There are 2 different implementations of AOT compilation under development at this time. The crossgen1 model (which is currently supported on all platforms and architectures), and the crossgen2 model, which is under active development. Any developer wishing to use hardware intrinsics in the runtime or libraries should be aware of the restrictions imposed by the crossgen1 model. Crossgen2, which we expect will replace crossgen1 at some point in the future, has strictly fewer restrictions.
###Code written in System.Private.CoreLib.dll
- Any code which uses
Vector<T>
will not be compiled AOT. (See code which throws a TypeLoadException usingIDS_EE_SIMD_NGEN_DISALLOWED
) - Code which uses Sse and Sse2 platform hardware intrinsics is always generated as it would be at jit time.
- Code which uses Sse3, Ssse3, Sse41, Sse42, Popcnt, Pclmulqdq, and Lzcnt instruction sets will be generated, but the associated IsSupported check will be a runtime check. See
FilterNamedIntrinsicMethodAttribs
for details on how this is done. - Code which uses other instruction sets will be generated as if the processor does not support that instruction set. (For instance, a usage of Avx2.IsSupported in CoreLib will generate native code where it unconditionally returns false, and then if and when tiered compilation occurs, the function may be rejitted and have code where the property returns true.)
- Non-platform intrinsics which require more hardware support than the minimum supported hardware capability will not take advantage of that capability. In particular the code generated for
Vector2/3/4.Dot
, andMath.Round
, andMathF.Round
. SeeFilterNamedIntrinsicMethodAttribs
for details. MethodImplOptions.AggressiveOptimization may be used to disable precompilation compilation of this sub-par code.
The rules here provide the following characteristics.
- Some platform specific hardware intrinsics can be used in CoreLib without encountering a startup time penalty
- Some uses of platform specific hardware intrinsics will force the compiler to be unable to AOT compile the code. However, if care is taken to only use intrinsics from the Sse, Sse2, Sse3, Ssse3, Sse41, Sse42, Popcnt, Pclmulqdq, or Lzcnt instruction sets, then the code may be AOT compiled. Preventing AOT compilation may cause a startup time penalty for important scenarios.
- Use of
Vector<T>
causes runtime jit and startup time concerns because it is never precompiled. Current analysis indicates this is acceptable, but it is a perennial concern for applications with tight startup time requirements. - AOT generated code which could take advantage of more advanced hardware support experiences a performance penalty until rejitted. (If a customer chooses to disable tiered compilation, then customer code may always run slowly).
- Any use of a platform intrinsic in the codebase MUST be wrapped with a call to the associated IsSupported property. This wrapping MUST be done within the same function that uses the hardware intrinsic, and MUST NOT be in a wrapper function unless it is one of the intrinsics that are enabled by default for crossgen compilation of System.Private.CoreLib (See list above in the implementation rules section).
- Within a single function that uses platform intrinsics, it must behave identically regardless of whether IsSupported returns true or not. This rule is required as code inside of an IsSupported check that calls a helper function cannot assume that the helper function will itself see its use of the same IsSupported check return true. This is due to the impact of tiered compilation on code execution within the process.
- Excessive use of intrinsics may cause startup performance problems due to additional jitting, or may not achieve desired performance characteristics due to suboptimal codegen.
ACCEPTABLE Code
using System.Runtime.Intrinsics.X86;
public class BitOperations
{
public static int PopCount(uint value)
{
if (Avx2.IsSupported)
{
Some series of Avx2 instructions that performs the popcount operation.
}
else
return FallbackPath(input);
}
private static int FallbackPath(uint)
{
const uint c1 = 0x_55555555u;
const uint c2 = 0x_33333333u;
const uint c3 = 0x_0F0F0F0Fu;
const uint c4 = 0x_01010101u;
value -= (value >> 1) & c1;
value = (value & c2) + ((value >> 2) & c2);
value = (((value + (value >> 4)) & c3) * c4) >> 24;
return (int)value;
}
}
UNACCEPTABLE code
using System.Runtime.Intrinsics.X86;
public class BitOperations
{
public static int PopCount(uint value)
{
if (Avx2.IsSupported)
return UseAvx2(value);
else
return FallbackPath(input);
}
private static int FallbackPath(uint)
{
const uint c1 = 0x_55555555u;
const uint c2 = 0x_33333333u;
const uint c3 = 0x_0F0F0F0Fu;
const uint c4 = 0x_01010101u;
value -= (value >> 1) & c1;
value = (value & c2) + ((value >> 2) & c2);
value = (((value + (value >> 4)) & c3) * c4) >> 24;
return (int)value;
}
private static int UseAvx2(uint value)
{
// THIS IS A BUG!!!!!
Some series of Avx2 instructions that performs the popcount operation.
The bug here is triggered by the presence of tiered compilation and R2R. The R2R version
of this method may be compiled as if the Avx2 feature is not available, and is not reliably rejitted
at the same time as the PopCount function.
As a special note, on the x86 and x64 platforms, this generally unsafe pattern may be used
with the Sse, Sse2, Sse3, Sssse3, Ssse41 and Sse42 instruction sets as those instruction sets
are treated specially by both crossgen1 and crossgen2 when compiling System.Private.CoreLib.dll.
}
}
- Any code which uses an intrinsic from the
System.Runtime.Intrinsics.Arm
orSystem.Runtime.Intrinsics.X86
namespace will not be compiled AOT. (See code which throws a TypeLoadException usingIDS_EE_HWINTRINSIC_NGEN_DISALLOWED
) - Any code which uses
Vector<T>
will not be compiled AOT. (See code which throws a TypeLoadException usingIDS_EE_SIMD_NGEN_DISALLOWED
) - Any code which uses
Vector64<T>
,Vector128<T>
orVector256<T>
will not be compiled AOT. (See code which throws a TypeLoadException usingIDS_EE_HWINTRINSIC_NGEN_DISALLOWED
) - Non-platform intrinsics which require more hardware support than the minimum supported hardware capability will not take advantage of that capability. In particular the code generated for Vector2/3/4 is sub-optimal. MethodImplOptions.AggressiveOptimization may be used to disable compilation of this sub-par code.
The rules here provide the following characteristics.
- Use of platform specific hardware intrinsics causes runtime jit and startup time concerns.
- Use of
Vector<T>
causes runtime jit and startup time concerns - AOT generated code which could take advantage of more advanced hardware support experiences a performance penalty until rejitted. (If a customer chooses to disable tiered compilation, then customer code may always run slowly).
- Any use of a platform intrinsic in the codebase SHOULD be wrapped with a call to the associated IsSupported property. This wrapping may be done within the same function that uses the hardware intrinsic, but this is not required as long as the programmer can control all entrypoints to a function that uses the hardware intrinsic.
- If an application developer is highly concerned about startup performance, developers should avoid use of all platform specific hardware intrinsics on startup paths.
There are 2 sets of instruction sets known to the compiler.
- The baseline instruction set which defaults to (Sse, Sse2), but may be adjusted via compiler option.
- The optimistic instruction set which defaults to (Sse3, Ssse3, Sse41, Sse42, Popcnt, Pclmulqdq, and Lzcnt).
Code will be compiled using the optimistic instruction set to drive compilation, but any use of an instruction set beyond the baseline instruction set will be recorded, as will any attempt to use an instruction set beyond the optimistic set if that attempted use has a semantic effect. If the baseline instruction set includes Avx2
then the size and characteristics of of Vector<T>
is known. Any other decisions about ABI may also be encoded. For instance, it is likely that the ABI of Vector256<T>
will vary based on the presence/absence of Avx
support.
- Any code which uses
Vector<T>
will not be compiled AOT unless the size ofVector<T>
is known. - Any code which passes a
Vector256<T>
as a parameter on a Linux or Mac machine will not be compiled AOT unless the support for theAvx
instruction set is known. - Non-platform intrinsics which require more hardware support than the optimistic supported hardware capability will not take advantage of that capability. MethodImplOptions.AggressiveOptimization may be used to disable compilation of this sub-par code.
- Code which takes advantage of instructions sets in the optimistic set will not be used on a machine which only supports the baseline instruction set.
- Code which attempts to use instruction sets outside of the optimistic set will generate code that will not be used on machines with support for the instruction set.
- Code which uses platform intrinsics within the optimistic instruction set will generate good code.
- Code which relies on platform intrinsics not within the baseline or optimistic set will cause runtime jit and startup time concerns if used on hardware which does support the instruction set.
Vector<T>
code has runtime jit and startup time concerns unless the baseline is raised to includeAvx2
.
- Any use of a platform intrinsic in the codebase SHOULD be wrapped with a call to the associated IsSupported property. This wrapping may be done within the same function that uses the hardware intrinsic, but this is not required as long as the programmer can control all entrypoints to a function that uses the hardware intrinsic.
- If an application developer is highly concerned about startup performance, developers should avoid use intrinsics beyond Sse42, or should use Crossgen with an updated baseline instruction set support.
Since System.Private.CoreLib.dll is known to be code reviewed with the code review rules as written above for crossgen1 with System.Private.CoreLib.dll, it is possible to relax rule "Code which attempts to use instruction sets outside of the optimistic set will generate code that will not be used on machines with support for the instruction set." What this will do is allow the generation of non-optimal code for these situations, but through the magic of code review, the generated logic will still work correctly.
The JIT receives flags which instruct it on what instruction sets are valid to use, and has access to a new jit interface api notifyInstructionSetUsage(isa, bool supportBehaviorRequired)
.
The notifyInstructionSetUsage api is used to notify the AOT compiler infrastructure that the code may only execute if the runtime environment of the code is exactly the same as the boolean parameter indicates it should be. For instance, if notifyInstructionSetUsage(Avx, false)
is used, then the code generated must not be used if the Avx
instruction set is useable. Similarly notifyInstructionSetUsage(Avx, true)
will indicate that the code may only be used if the Avx
instruction set is available.
While the above api exists, it is not expected that general purpose code within the JIT will use it. In general jitted code is expected to use a number of different apis to understand the available hardware instruction support available.
Api | Description of use | Exact behavior |
---|---|---|
compExactlyDependsOn(isa) |
Use when making a decision to use or not use an instruction set when the decision will affect the semantics of the generated code. Should never be used in an assert. | Return whether or not an instruction set is supported. Calls notifyInstructionSetUsage with the result of that computation. |
compOpportunisticallyDependsOn(isa) |
Use when making an opportunistic decision to use or not use an instruction set. Use when the instruction set usage is a "nice to have optimization opportunity", but do not use when a false result may change the semantics of the program. Should never be used in an assert. | Return whether or not an instruction set is supported. Calls notifyInstructionSetUsage if the instruction set is supported. |
compIsaSupportedDebugOnly(isa) |
Use to assert whether or not an instruction set is supported | Return whether or not an instruction set is supported. Does not report anything. Only available in debug builds. |
getSIMDSupportLevel() |
Use when determining what codegen to generate for code that operates on Vector<T> , Vector2 , Vector3 or Vector4 . |
Queries the instruction sets supported using compOpportunisticallyDependsOn , and finds a set of instructions available to use for working with the platform agnostic vector types. |
getSIMDVectorType() |
Use to get the TYP of a the Vector<T> type. |
Determine the TYP of the Vector<T> type. If on the architecture the TYP may vary depending on whatever rules, this function will make sufficient use of the notifyInstructionSetUsage api to ensure that the TYP is consistent between compile time and runtime. |
getSIMDVectorRegisterByteLength() |
Use to get the size of a Vector<T> value. |
Determine the size of the Vector<T> type. If on the architecture the size may vary depending on whatever rules, this function will make sufficient use of the notifyInstructionSetUsage api to ensure that the size is consistent between compile time and runtime. |
maxSIMDStructBytes() |
Get the maximum number of bytes that might be used in a SIMD type during this compilation. | Query the set of instruction sets supported, and determine the largest simd type supported. Use compOpportunisticallyDependsOn to perform the queries so that the maximum size needed is the only one recorded. |
largestEnregisterableStructSize() |
Get the maximum number of bytes that might be represented by a single register in this compilation. Use only as an optimization to avoid calling impNormStructType or getBaseTypeAndSizeOfSIMDType . |
Query the set of instruction sets supported, and determine the largest simd type supported in this compilation, report that size. |