Skip to content

Code generation of HW intrinsics / loading struct into vector register #31692

Open
@luithefirst

Description

@luithefirst

Hello!

I’m trying to make use of HW intrinsics to improve the performance of a vector data type:
struct MyVector4 { public float X, Y, Z, W; }
My goal is to implement typical properties/methods such as Length, LengthSquared, DotProduct with superior performance to a naive implementation.
I started with the Length property and wanted to make use of Sse41.DotProduct. I have experimented with different implementations, however, I did not manage to get the code I was looking for.
For benchmarking and evaluation I'm using BenchmarkDotnet. The test routine calls the property in a loop like this:

[Benchmark]
public float Vec4Length_Sse_V1()
{
    var local = arr;
    var sum = 0.0f;
    for (int i = 0; i < local.Length; i++)
        sum += local[i].Length_Sse_V1;
    return sum;
}

Here are my 3 implementations using Sse and details on their generated code within the loop body. I've marked curious regions with ***.

  1. Using fixed:
float Length_Sse_V1 {
    get {
        unsafe {
            fixed (MyVector4* pthis = &this)
            {
                var mmx = Sse.LoadVector128((float*)pthis);
                mmx = Sse41.DotProduct(mmx, mmx, 0xF1);
                var l2 = mmx.GetElement(0);
                return MathF.Sqrt(l2);
            }
} } }
8fe8975b  movsxd  r8,ecx
8fe8975e  shl     r8,4
8fe89762  lea     r8,[rax+r8+10h]
8fe89767  xor     r9d,r9d		***
8fe8976a  mov     qword ptr [rsp],r9	***
8fe8976e  mov     qword ptr [rsp],r8	***
8fe89772  vmovups xmm1,xmmword ptr [r8]
8fe89777  vdpps   xmm1,xmm1,xmm1,0F1h
8fe8977d  vsqrtss xmm1,xmm1,xmm1
8fe89781  mov     qword ptr [rsp],r9	***
8fe89785  vaddss  xmm0,xmm0,xmm1
8fe89789  inc     ecx
8fe8978b  cmp     ecx,edx
8fe8978d  jl      00007fff`8fe8975b
  1. Using a helper function:
static unsafe float Length_Sse_V2_Helper(MyVector4 vec)
{
    var ptr = (float*)&vec;
    var mmx = Sse.LoadVector128(ptr);
    mmx = Sse41.DotProduct(mmx, mmx, 0xF1);
    var l2 = mmx.GetElement(0);
    return MathF.Sqrt(l2);
}
float Length_Sse_V2
{
    get { return Length_Sse_V2_Helper(this); }
}
8fe89758  movsxd  r8,ecx
8fe8975b  shl     r8,4
8fe8975f  lea     r8,[rax+r8+10h]
8fe89764  vmovdqu xmm1,xmmword ptr [r8]		***
8fe89769  vmovdqu xmmword ptr [rsp+8],xmm1	***
8fe8976f  lea     r8,[rsp+8]			***
8fe89774  vmovups xmm1,xmmword ptr [r8]
8fe89779  vdpps   xmm1,xmm1,xmm1,0F1h
8fe8977f  vsqrtss xmm1,xmm1,xmm1
8fe89783  vaddss  xmm0,xmm0,xmm1
8fe89787  inc     ecx
8fe89789  cmp     ecx,edx
8fe8978b  jl      00007fff`8fe89758
  1. Helper Inlined:
float Length_Sse_V3 {
    get { 
        unsafe {
            var vec = this;
            var ptr = (float*)&vec;
            var mmx = Sse.LoadVector128(ptr);
            mmx = Sse41.DotProduct(mmx, mmx, 0xF1);
            var l2 = mmx.GetElement(0);
            return MathF.Sqrt(l2);
        }
} } }
8fe69764  movsxd  r8,ecx
8fe69767  shl     r8,4
8fe6976b  lea     r8,[rax+r8+10h]
8fe69770  lea     r9,[rsp+8]			***
8fe69775  vxorps  xmm1,xmm1,xmm1		***
8fe69779  vmovdqu xmmword ptr [r9],xmm1 	***
8fe6977e  vmovdqu xmm1,xmmword ptr [r8]
8fe69783  vmovdqu xmmword ptr [rsp+8],xmm1	***
8fe69789  lea     r8,[rsp+8]			***
8fe6978e  vmovups xmm1,xmmword ptr [r8]     	***
8fe69793  vdpps   xmm1,xmm1,xmm1,0F1h
8fe69799  vsqrtss xmm1,xmm1,xmm1
8fe6979d  vaddss  xmm0,xmm0,xmm1
8fe697a1  inc     ecx
8fe697a3  cmp     ecx,edx
8fe697a5  jl      00007fff`8fe69764

The benchmark results are the following:

BenchmarkDotNet=v0.12.0, OS=Windows 10.0.18362
Intel Core i7-8700K CPU 3.70GHz (Coffee Lake), 1 CPU, 12 logical and 6 physical cores
.NET Core SDK=3.1.100
  [Host]     : .NET Core 3.1.0 (CoreCLR 4.700.19.56402, CoreFX 4.700.19.56404), X64 RyuJIT
  Job-OBQONZ : .NET Core 3.1.0 (CoreCLR 4.700.19.56402, CoreFX 4.700.19.56404), X64 RyuJIT

Runtime=.NET Core 3.0  
Method Mean Error StdDev
Vec4Length_Reference 1.451 ms 0.0046 ms 0.0041 ms
Vec4Length_Sse_V1 1.191 ms 0.0052 ms 0.0049 ms
Vec4Length_Sse_V2 1.283 ms 0.0035 ms 0.0029 ms
Vec4Length_Sse_V3 1.455 ms 0.0145 ms 0.0121 ms
Vec4Length_Sse_Array 1.107 ms 0.0231 ms 0.0308 ms

Note: On an i7-4790K V2 performs slightly better than V1.

There is already a small performance increase with V1 and V2, but the optimal code I'm targeting for is the one I get by pinning the entire array and directly using the offset pointer in LoadVector128:

8fe49787 4c63c1          movsxd  r8,ecx
8fe4978a 49c1e004        shl     r8,4
8fe4978e c4a178100c00    vmovups xmm1,xmmword ptr [rax+r8]
8fe49794 c4e37140c9f1    vdpps   xmm1,xmm1,xmm1,0F1h
8fe4979a c5f251c9        vsqrtss xmm1,xmm1,xmm1
8fe4979e c5fa58c1        vaddss  xmm0,xmm0,xmm1
8fe497a2 ffc1            inc     ecx
8fe497a4 3bca            cmp     ecx,edx
8fe497a6 7cdf            jl      00007fff`8fe49787

The difficulty when implementing this as property seems to be loading the data from the struct into the vector registers. There are always some additional instructions, which purpose I do not understand, but I suppose that they are leftover of the compiler from optimizing the abstraction away. I'm no expert in this field and would be grateful if you could comment on my implementation and point out why the generated code is the way it is. Maybe it is also an interesting test case and it is possible to find improvements.

You can find the entire code in this repository: https://github.com/luithefirst/IntrinsicsCodeGen
Thanks!

category:cq
theme:vector-codegen
skill-level:intermediate
cost:medium

Metadata

Metadata

Assignees

No one assigned

    Labels

    area-CodeGen-coreclrCLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMIoptimization

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions