Description
When porting Klein (a highly optimized C++ library for doing 3D projective geometric algebra) to Klein# (a C# .NET Core 3+ library using SIMD intrinsics), I noticed that a very silly function that should evaluate to a constant causes the JIT to generate inefficient code.
Manually replacing the function calls with a constant generates about the same SIMD code as the LLVM-compiled C++ code.
The culprit is the following method:
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static byte _MM_SHUFFLE(int a, int b, int c, int d)
{
return (byte)(a << 6 | b << 4 | c << 2 | d);
}
Note that the C++ code cheats because it uses a macro to compute the shuffle control byte, so it doesn't suffer from this...
When passing the result from this method to the Shuffle
SIMD method, I get very weird assembly code, every call to Shuffle
calls the follow subroutine, it seems some kind of lookup table is being generated:
00007FFF9048C72B lea r9,[7FFF9048D220h]
00007FFF9048C732 mov r9d,dword ptr [r9+rax*4]
00007FFF9048C736 lea rdx,[7FFF9048C723h]
00007FFF9048C73D add r9,rdx
00007FFF9048C740 jmp r9
00007FFF9048C743 vshufps xmm0,xmm0,xmmword ptr [r8],0
00007FFF9048C749 jmp 00007FFF9048D213
00007FFF9048C74E vshufps xmm0,xmm0,xmmword ptr [r8],1
00007FFF9048C754 jmp 00007FFF9048D213
00007FFF9048C759 vshufps xmm0,xmm0,xmmword ptr [r8],2
00007FFF9048C75F jmp 00007FFF9048D213
00007FFF9048C764 vshufps xmm0,xmm0,xmmword ptr [r8],3
00007FFF9048C76A jmp 00007FFF9048D213
...
00007FFF9048D213 vmovupd xmmword ptr [rcx],xmm0
00007FFF9048D217 mov rax,rcx
00007FFF9048D21A ret
When I inline the control
parameter of the Shuffle
method, I get a single SIMD instruction vshufps
for each Shuffle
. And actually, the full geometric product code is compiled by the JIT to almost the same code is the LLVM C++ compiler does:
00007FFF9CDC9520 vzeroupper
00007FFF9CDC9523 vmovupd xmm0,xmmword ptr [rdx]
00007FFF9CDC9527 vmovaps xmm1,xmm0
00007FFF9CDC952B vshufps xmm1,xmm1,xmm1,0
00007FFF9CDC9530 vmovupd xmm2,xmmword ptr [r8]
00007FFF9CDC9535 vmovaps xmm3,xmm2
00007FFF9CDC9539 vmulps xmm1,xmm1,xmm3
00007FFF9CDC953D vmovaps xmm3,xmm0
00007FFF9CDC9541 vshufps xmm3,xmm3,xmm3,79h
00007FFF9CDC9546 vmovaps xmm4,xmm2
00007FFF9CDC954A vshufps xmm4,xmm4,xmm4,9Dh
00007FFF9CDC954F vmulps xmm3,xmm3,xmm4
00007FFF9CDC9553 vsubps xmm1,xmm1,xmm3
00007FFF9CDC9557 vmovaps xmm3,xmm0
00007FFF9CDC955B vshufps xmm3,xmm3,xmm3,0E6h
00007FFF9CDC9560 vmovaps xmm4,xmm2
00007FFF9CDC9564 vshufps xmm4,xmm4,xmm4,2
00007FFF9CDC9569 vmulps xmm3,xmm3,xmm4
00007FFF9CDC956D vshufps xmm0,xmm0,xmm0,9Fh
00007FFF9CDC9572 vshufps xmm2,xmm2,xmm2,7Bh
00007FFF9CDC9577 vmulps xmm0,xmm0,xmm2
00007FFF9CDC957B vaddps xmm0,xmm3,xmm0
00007FFF9CDC957F vxorps xmm2,xmm2,xmm2
00007FFF9CDC9583 vmovss xmm3,dword ptr [7FFF9CDC95C0h]
00007FFF9CDC958B vmovss xmm2,xmm2,xmm3
00007FFF9CDC958F vxorps xmm0,xmm0,xmm2
00007FFF9CDC9593 vaddps xmm0,xmm1,xmm0
00007FFF9CDC9597 vmovupd xmmword ptr [rcx],xmm0
00007FFF9CDC959B mov rax,rcx
00007FFF9CDC959E ret
Is a workaround possible? Since C# doesn't have macros nor constexpr
, I don't think I have options? I tried to change the method so that byte
are passed and conversion from int
to byte
doesn't happen, but that didn't help...
category:cq
theme:hardware-intrinsics
skill-level:expert
cost:medium