Description
Hello!
I’m trying to make use of HW intrinsics to improve the performance of a vector data type:
struct MyVector4 { public float X, Y, Z, W; }
My goal is to implement typical properties/methods such as Length, LengthSquared, DotProduct with superior performance to a naive implementation.
I started with the Length property and wanted to make use of Sse41.DotProduct. I have experimented with different implementations, however, I did not manage to get the code I was looking for.
For benchmarking and evaluation I'm using BenchmarkDotnet. The test routine calls the property in a loop like this:
[Benchmark]
public float Vec4Length_Sse_V1()
{
var local = arr;
var sum = 0.0f;
for (int i = 0; i < local.Length; i++)
sum += local[i].Length_Sse_V1;
return sum;
}
Here are my 3 implementations using Sse and details on their generated code within the loop body. I've marked curious regions with ***.
- Using fixed:
float Length_Sse_V1 {
get {
unsafe {
fixed (MyVector4* pthis = &this)
{
var mmx = Sse.LoadVector128((float*)pthis);
mmx = Sse41.DotProduct(mmx, mmx, 0xF1);
var l2 = mmx.GetElement(0);
return MathF.Sqrt(l2);
}
} } }
8fe8975b movsxd r8,ecx
8fe8975e shl r8,4
8fe89762 lea r8,[rax+r8+10h]
8fe89767 xor r9d,r9d ***
8fe8976a mov qword ptr [rsp],r9 ***
8fe8976e mov qword ptr [rsp],r8 ***
8fe89772 vmovups xmm1,xmmword ptr [r8]
8fe89777 vdpps xmm1,xmm1,xmm1,0F1h
8fe8977d vsqrtss xmm1,xmm1,xmm1
8fe89781 mov qword ptr [rsp],r9 ***
8fe89785 vaddss xmm0,xmm0,xmm1
8fe89789 inc ecx
8fe8978b cmp ecx,edx
8fe8978d jl 00007fff`8fe8975b
- Using a helper function:
static unsafe float Length_Sse_V2_Helper(MyVector4 vec)
{
var ptr = (float*)&vec;
var mmx = Sse.LoadVector128(ptr);
mmx = Sse41.DotProduct(mmx, mmx, 0xF1);
var l2 = mmx.GetElement(0);
return MathF.Sqrt(l2);
}
float Length_Sse_V2
{
get { return Length_Sse_V2_Helper(this); }
}
8fe89758 movsxd r8,ecx
8fe8975b shl r8,4
8fe8975f lea r8,[rax+r8+10h]
8fe89764 vmovdqu xmm1,xmmword ptr [r8] ***
8fe89769 vmovdqu xmmword ptr [rsp+8],xmm1 ***
8fe8976f lea r8,[rsp+8] ***
8fe89774 vmovups xmm1,xmmword ptr [r8]
8fe89779 vdpps xmm1,xmm1,xmm1,0F1h
8fe8977f vsqrtss xmm1,xmm1,xmm1
8fe89783 vaddss xmm0,xmm0,xmm1
8fe89787 inc ecx
8fe89789 cmp ecx,edx
8fe8978b jl 00007fff`8fe89758
- Helper Inlined:
float Length_Sse_V3 {
get {
unsafe {
var vec = this;
var ptr = (float*)&vec;
var mmx = Sse.LoadVector128(ptr);
mmx = Sse41.DotProduct(mmx, mmx, 0xF1);
var l2 = mmx.GetElement(0);
return MathF.Sqrt(l2);
}
} } }
8fe69764 movsxd r8,ecx
8fe69767 shl r8,4
8fe6976b lea r8,[rax+r8+10h]
8fe69770 lea r9,[rsp+8] ***
8fe69775 vxorps xmm1,xmm1,xmm1 ***
8fe69779 vmovdqu xmmword ptr [r9],xmm1 ***
8fe6977e vmovdqu xmm1,xmmword ptr [r8]
8fe69783 vmovdqu xmmword ptr [rsp+8],xmm1 ***
8fe69789 lea r8,[rsp+8] ***
8fe6978e vmovups xmm1,xmmword ptr [r8] ***
8fe69793 vdpps xmm1,xmm1,xmm1,0F1h
8fe69799 vsqrtss xmm1,xmm1,xmm1
8fe6979d vaddss xmm0,xmm0,xmm1
8fe697a1 inc ecx
8fe697a3 cmp ecx,edx
8fe697a5 jl 00007fff`8fe69764
The benchmark results are the following:
BenchmarkDotNet=v0.12.0, OS=Windows 10.0.18362
Intel Core i7-8700K CPU 3.70GHz (Coffee Lake), 1 CPU, 12 logical and 6 physical cores
.NET Core SDK=3.1.100
[Host] : .NET Core 3.1.0 (CoreCLR 4.700.19.56402, CoreFX 4.700.19.56404), X64 RyuJIT
Job-OBQONZ : .NET Core 3.1.0 (CoreCLR 4.700.19.56402, CoreFX 4.700.19.56404), X64 RyuJIT
Runtime=.NET Core 3.0
Method | Mean | Error | StdDev |
---|---|---|---|
Vec4Length_Reference | 1.451 ms | 0.0046 ms | 0.0041 ms |
Vec4Length_Sse_V1 | 1.191 ms | 0.0052 ms | 0.0049 ms |
Vec4Length_Sse_V2 | 1.283 ms | 0.0035 ms | 0.0029 ms |
Vec4Length_Sse_V3 | 1.455 ms | 0.0145 ms | 0.0121 ms |
Vec4Length_Sse_Array | 1.107 ms | 0.0231 ms | 0.0308 ms |
Note: On an i7-4790K V2 performs slightly better than V1.
There is already a small performance increase with V1 and V2, but the optimal code I'm targeting for is the one I get by pinning the entire array and directly using the offset pointer in LoadVector128:
8fe49787 4c63c1 movsxd r8,ecx
8fe4978a 49c1e004 shl r8,4
8fe4978e c4a178100c00 vmovups xmm1,xmmword ptr [rax+r8]
8fe49794 c4e37140c9f1 vdpps xmm1,xmm1,xmm1,0F1h
8fe4979a c5f251c9 vsqrtss xmm1,xmm1,xmm1
8fe4979e c5fa58c1 vaddss xmm0,xmm0,xmm1
8fe497a2 ffc1 inc ecx
8fe497a4 3bca cmp ecx,edx
8fe497a6 7cdf jl 00007fff`8fe49787
The difficulty when implementing this as property seems to be loading the data from the struct into the vector registers. There are always some additional instructions, which purpose I do not understand, but I suppose that they are leftover of the compiler from optimizing the abstraction away. I'm no expert in this field and would be grateful if you could comment on my implementation and point out why the generated code is the way it is. Maybe it is also an interesting test case and it is possible to find improvements.
You can find the entire code in this repository: https://github.com/luithefirst/IntrinsicsCodeGen
Thanks!
category:cq
theme:vector-codegen
skill-level:intermediate
cost:medium