Skip to content

Static method that evaluates to a constant not being inlined by .NET Core 3 and 5 JIT #38003

Open
@ziriax

Description

@ziriax

When porting Klein (a highly optimized C++ library for doing 3D projective geometric algebra) to Klein# (a C# .NET Core 3+ library using SIMD intrinsics), I noticed that a very silly function that should evaluate to a constant causes the JIT to generate inefficient code.

Manually replacing the function calls with a constant generates about the same SIMD code as the LLVM-compiled C++ code.

The culprit is the following method:

[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static byte _MM_SHUFFLE(int a, int b, int c, int d)
{
	return (byte)(a << 6 | b << 4 | c << 2 | d);
} 

Note that the C++ code cheats because it uses a macro to compute the shuffle control byte, so it doesn't suffer from this...

When passing the result from this method to the Shuffle SIMD method, I get very weird assembly code, every call to Shuffle calls the follow subroutine, it seems some kind of lookup table is being generated:

00007FFF9048C72B  lea         r9,[7FFF9048D220h]  
00007FFF9048C732  mov         r9d,dword ptr [r9+rax*4]  
00007FFF9048C736  lea         rdx,[7FFF9048C723h]  
00007FFF9048C73D  add         r9,rdx  
00007FFF9048C740  jmp         r9  
00007FFF9048C743  vshufps     xmm0,xmm0,xmmword ptr [r8],0  
00007FFF9048C749  jmp         00007FFF9048D213  
00007FFF9048C74E  vshufps     xmm0,xmm0,xmmword ptr [r8],1  
00007FFF9048C754  jmp         00007FFF9048D213  
00007FFF9048C759  vshufps     xmm0,xmm0,xmmword ptr [r8],2  
00007FFF9048C75F  jmp         00007FFF9048D213  
00007FFF9048C764  vshufps     xmm0,xmm0,xmmword ptr [r8],3  
00007FFF9048C76A  jmp         00007FFF9048D213  
...
00007FFF9048D213  vmovupd     xmmword ptr [rcx],xmm0  
00007FFF9048D217  mov         rax,rcx  
00007FFF9048D21A  ret  

When I inline the control parameter of the Shuffle method, I get a single SIMD instruction vshufps for each Shuffle. And actually, the full geometric product code is compiled by the JIT to almost the same code is the LLVM C++ compiler does:

00007FFF9CDC9520  vzeroupper  
00007FFF9CDC9523  vmovupd     xmm0,xmmword ptr [rdx]  
00007FFF9CDC9527  vmovaps     xmm1,xmm0  
00007FFF9CDC952B  vshufps     xmm1,xmm1,xmm1,0  
00007FFF9CDC9530  vmovupd     xmm2,xmmword ptr [r8]  
00007FFF9CDC9535  vmovaps     xmm3,xmm2  
00007FFF9CDC9539  vmulps      xmm1,xmm1,xmm3  
00007FFF9CDC953D  vmovaps     xmm3,xmm0  
00007FFF9CDC9541  vshufps     xmm3,xmm3,xmm3,79h  
00007FFF9CDC9546  vmovaps     xmm4,xmm2  
00007FFF9CDC954A  vshufps     xmm4,xmm4,xmm4,9Dh  
00007FFF9CDC954F  vmulps      xmm3,xmm3,xmm4  
00007FFF9CDC9553  vsubps      xmm1,xmm1,xmm3  
00007FFF9CDC9557  vmovaps     xmm3,xmm0  
00007FFF9CDC955B  vshufps     xmm3,xmm3,xmm3,0E6h  
00007FFF9CDC9560  vmovaps     xmm4,xmm2  
00007FFF9CDC9564  vshufps     xmm4,xmm4,xmm4,2  
00007FFF9CDC9569  vmulps      xmm3,xmm3,xmm4  
00007FFF9CDC956D  vshufps     xmm0,xmm0,xmm0,9Fh  
00007FFF9CDC9572  vshufps     xmm2,xmm2,xmm2,7Bh  
00007FFF9CDC9577  vmulps      xmm0,xmm0,xmm2  
00007FFF9CDC957B  vaddps      xmm0,xmm3,xmm0  
00007FFF9CDC957F  vxorps      xmm2,xmm2,xmm2  
00007FFF9CDC9583  vmovss      xmm3,dword ptr [7FFF9CDC95C0h]  
00007FFF9CDC958B  vmovss      xmm2,xmm2,xmm3  
00007FFF9CDC958F  vxorps      xmm0,xmm0,xmm2  
00007FFF9CDC9593  vaddps      xmm0,xmm1,xmm0  
00007FFF9CDC9597  vmovupd     xmmword ptr [rcx],xmm0  
00007FFF9CDC959B  mov         rax,rcx  
00007FFF9CDC959E  ret  

Is a workaround possible? Since C# doesn't have macros nor constexpr, I don't think I have options? I tried to change the method so that byte are passed and conversion from int to byte doesn't happen, but that didn't help...

category:cq
theme:hardware-intrinsics
skill-level:expert
cost:medium

Metadata

Metadata

Assignees

No one assigned

    Labels

    area-CodeGen-coreclrCLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMIoptimization

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions