Skip to content

Commit 7eae2dd

Browse files
lemireUbuntu
and
Ubuntu
authored
fix: optimize the ARM function for systems with weak SIMD performance (#50)
* fix: optimize the ARM function for systems with weak SIMD performance * updating results on README * [no-ci] Updating v1 numbers * [no-ci] Updating qualcomm numbers * Update UTF8.cs --------- Co-authored-by: Ubuntu <[email protected]>
1 parent 6f92b06 commit 7eae2dd

File tree

2 files changed

+74
-45
lines changed

2 files changed

+74
-45
lines changed

README.md

+32-32
Original file line numberDiff line numberDiff line change
@@ -145,38 +145,38 @@ faster than the standard library.
145145
| Latin-Lipsum | 87 | 38 | 2.3 x |
146146
| Russian-Lipsum | 7.4 | 2.7 | 2.7 x |
147147

148-
On a Neoverse V1 (Graviton 3), our validation function is 1.3 to over four times
148+
On a Neoverse V1 (Graviton 3), our validation function is 1.3 to over five times
149149
faster than the standard library.
150150

151151
| data set | SimdUnicode speed (GB/s) | .NET speed (GB/s) | speed up |
152152
|:----------------|:-----------|:--------------------------|:-------------------|
153-
| Twitter.json | 12 | 8.7 | 1.4 x |
154-
| Arabic-Lipsum | 3.4 | 2.0 | 1.7 x |
155-
| Chinese-Lipsum | 3.4 | 2.6 | 1.3 x |
156-
| Emoji-Lipsum | 3.4 | 0.8 | 4.3 x |
157-
| Hebrew-Lipsum | 3.4 | 2.0 | 1.7 x |
158-
| Hindi-Lipsum | 3.4 | 1.6 | 2.1 x |
159-
| Japanese-Lipsum | 3.4 | 2.4  | 1.4 x |
160-
| Korean-Lipsum | 3.4 | 1.3 | 2.6 x |
153+
| Twitter.json | 14 | 8.7 | 1.4 x |
154+
| Arabic-Lipsum | 4.2 | 2.0 | 2.1 x |
155+
| Chinese-Lipsum | 4.2 | 2.6 | 1.6 x |
156+
| Emoji-Lipsum | 4.2 | 0.8 | 5.3 x |
157+
| Hebrew-Lipsum | 4.2 | 2.0 | 2.1 x |
158+
| Hindi-Lipsum | 4.2 | 1.6 | 2.6 x |
159+
| Japanese-Lipsum | 4.2 | 2.4  | 1.8 x |
160+
| Korean-Lipsum | 4.2 | 1.3 | 3.2 x |
161161
| Latin-Lipsum | 42 | 17 | 2.5 x |
162-
| Russian-Lipsum | 3.3 | 0.95 | 3.5 x |
162+
| Russian-Lipsum | 4.2 | 0.95 | 4.4 x |
163163

164164

165165
On a Qualcomm 8cx gen3 (Windows Dev Kit 2023), we get roughly the same relative performance
166166
boost as the Neoverse V1.
167167

168168
| data set | SimdUnicode speed (GB/s) | .NET speed (GB/s) | speed up |
169169
|:----------------|:-----------|:--------------------------|:-------------------|
170-
| Twitter.json | 15 | 10 | 1.5 x |
171-
| Arabic-Lipsum | 4.0 | 2.3 | 1.7 x |
172-
| Chinese-Lipsum | 4.0 | 2.9 | 1.4 x |
173-
| Emoji-Lipsum | 4.0 | 0.9 | 4.4 x |
174-
| Hebrew-Lipsum | 4.0 | 2.3 | 1.7 x |
175-
| Hindi-Lipsum | 4.0 | 1.9 | 2.1 x |
176-
| Japanese-Lipsum | 4.0 | 2.7  | 1.5 x |
177-
| Korean-Lipsum | 4.0 | 1.5 | 2.7 x |
170+
| Twitter.json | 17 | 10 | 1.7 x |
171+
| Arabic-Lipsum | 5.0 | 2.3 | 2.2 x |
172+
| Chinese-Lipsum | 5.0 | 2.9 | 1.7 x |
173+
| Emoji-Lipsum | 5.0 | 0.9 | 5.5 x |
174+
| Hebrew-Lipsum | 5.0 | 2.3 | 2.2 x |
175+
| Hindi-Lipsum | 5.0 | 1.9 | 2.6 x |
176+
| Japanese-Lipsum | 5.0 | 2.7  | 1.9 x |
177+
| Korean-Lipsum | 5.0 | 1.5 | 3.3 x |
178178
| Latin-Lipsum | 50 | 20 | 2.5 x |
179-
| Russian-Lipsum | 4.0 | 1.2 | 3.3 x |
179+
| Russian-Lipsum | 5.0 | 1.2 | 5.2 x |
180180

181181

182182
On a Neoverse N1 (Graviton 2), our validation function is 1.3 to over four times
@@ -195,23 +195,23 @@ faster than the standard library.
195195
| Latin-Lipsum | 42 | 17 | 2.5 x |
196196
| Russian-Lipsum | 3.3 | 0.95 | 3.5 x |
197197

198-
On a Neoverse N1 (Graviton 2), our validation function is up to three times
198+
On a Neoverse N1 (Graviton 2), our validation function is up to over three times
199199
faster than the standard library.
200200

201+
201202
| data set | SimdUnicode speed (GB/s) | .NET speed (GB/s) | speed up |
202203
|:----------------|:-----------|:--------------------------|:-------------------|
203-
| Twitter.json | 7.0 | 5.7 | 1.2 x |
204-
| Arabic-Lipsum | 2.2 | 0.9 | 2.4 x |
205-
| Chinese-Lipsum | 2.1 | 1.8 | 1.1 x |
206-
| Emoji-Lipsum | 1.8 | 0.7 | 2.6 x |
207-
| Hebrew-Lipsum | 2.0 | 0.9 | 2.2 x |
208-
| Hindi-Lipsum | 2.0 | 1.0 | 2.0 x |
209-
| Japanese-Lipsum | 2.1 | 1.7  | 1.2 x |
210-
| Korean-Lipsum | 2.2 | 1.0 | 2.2 x |
211-
| Latin-Lipsum | 24 | 13 | 1.8 x |
212-
| Russian-Lipsum | 2.1 | 0.7 | 3.0 x |
213-
214-
One difficulty with ARM processors is that they have varied SIMD/NEON performance. For example, Neoverse N1 processors, not to be confused with the Neoverse V1 design used by AWS Graviton 3, have weak SIMD performance. Of course, one can pick and choose which approach is best and it is not necessary to apply SimdUnicode is all cases. We expect good performance on recent ARM-based Qualcomm processors.
204+
| Twitter.json | 7.8 | 5.7 | 1.4 x |
205+
| Arabic-Lipsum | 2.5 | 0.9 | 2.8 x |
206+
| Chinese-Lipsum | 2.5 | 1.8 | 1.4 x |
207+
| Emoji-Lipsum | 2.5 | 0.7 | 3.6 x |
208+
| Hebrew-Lipsum | 2.5 | 0.9 | 2.7 x |
209+
| Hindi-Lipsum | 2.3 | 1.0 | 2.3 x |
210+
| Japanese-Lipsum | 2.4 | 1.7  | 1.4 x |
211+
| Korean-Lipsum | 2.5 | 1.0 | 2.5 x |
212+
| Latin-Lipsum | 23 | 13 | 1.8 x |
213+
| Russian-Lipsum | 2.3 | 0.7 | 3.3 x |
214+
215215

216216
## Building the library
217217

src/UTF8.cs

+42-13
Original file line numberDiff line numberDiff line change
@@ -1277,7 +1277,6 @@ private unsafe static (int utfadjust, int scalaradjust) calculateErrorPathadjust
12771277
}
12781278
return GetPointerToFirstInvalidByteScalar(pInputBuffer + processedLength, inputLength - processedLength, out utf16CodeUnitCountAdjustment, out scalarCountAdjustment);
12791279
}
1280-
12811280
public unsafe static byte* GetPointerToFirstInvalidByteArm64(byte* pInputBuffer, int inputLength, out int utf16CodeUnitCountAdjustment, out int scalarCountAdjustment)
12821281
{
12831282
int processedLength = 0;
@@ -1360,18 +1359,31 @@ private unsafe static (int utfadjust, int scalaradjust) calculateErrorPathadjust
13601359
// The block goes from processedLength to processedLength/16*16.
13611360
int contbytes = 0; // number of continuation bytes in the block
13621361
int n4 = 0; // number of 4-byte sequences that start in this block
1362+
/////
1363+
// Design:
1364+
// Instead of updating n4 and contbytes continuously, we accumulate
1365+
// the values in n4v and contv, while using overflowCounter to make
1366+
// sure we do not overflow. This allows you to reach good performance
1367+
// on systems where summing across vectors is slow.
1368+
////
1369+
Vector128<sbyte> n4v = Vector128<sbyte>.Zero;
1370+
Vector128<sbyte> contv = Vector128<sbyte>.Zero;
1371+
int overflowCounter = 0;
13631372
for (; processedLength + 16 <= inputLength; processedLength += 16)
13641373
{
13651374

13661375
Vector128<byte> currentBlock = AdvSimd.LoadVector128(pInputBuffer + processedLength);
13671376
if ((currentBlock & v80) == Vector128<byte>.Zero)
1368-
// We could also use (AdvSimd.Arm64.MaxAcross(currentBlock).ToScalar() <= 127) but it is slower on some
1369-
// hardware.
13701377
{
13711378
// We have an ASCII block, no need to process it, but
13721379
// we need to check if the previous block was incomplete.
13731380
if (prevIncomplete != Vector128<byte>.Zero)
13741381
{
1382+
contbytes += -AdvSimd.Arm64.AddAcrossWidening(contv).ToScalar();
1383+
if (n4v != Vector128<sbyte>.Zero)
1384+
{
1385+
n4 += -AdvSimd.Arm64.AddAcrossWidening(n4v).ToScalar();
1386+
}
13751387
int off = processedLength >= 3 ? processedLength - 3 : processedLength;
13761388
byte* invalidBytePointer = SimdUnicode.UTF8.SimpleRewindAndValidateWithErrors(16 - 3, pInputBuffer + processedLength - 3, inputLength - processedLength + 3);
13771389
// So the code is correct up to invalidBytePointer
@@ -1432,11 +1444,13 @@ private unsafe static (int utfadjust, int scalaradjust) calculateErrorPathadjust
14321444
Vector128<byte> must23 = AdvSimd.Or(isThirdByte, isFourthByte);
14331445
Vector128<byte> must23As80 = AdvSimd.And(must23, v80);
14341446
Vector128<byte> error = AdvSimd.Xor(must23As80, sc);
1435-
// AdvSimd.Arm64.MaxAcross(error) works, but it might be slower
1436-
// than AdvSimd.Arm64.MaxAcross(Vector128.AsUInt32(error)) on some
1437-
// hardware:
14381447
if (error != Vector128<byte>.Zero)
14391448
{
1449+
contbytes += -AdvSimd.Arm64.AddAcrossWidening(contv).ToScalar();
1450+
if (n4v != Vector128<sbyte>.Zero)
1451+
{
1452+
n4 += -AdvSimd.Arm64.AddAcrossWidening(n4v).ToScalar();
1453+
}
14401454
byte* invalidBytePointer;
14411455
if (processedLength == 0)
14421456
{
@@ -1459,17 +1473,32 @@ private unsafe static (int utfadjust, int scalaradjust) calculateErrorPathadjust
14591473
return invalidBytePointer;
14601474
}
14611475
prevIncomplete = AdvSimd.SubtractSaturate(currentBlock, maxValue);
1462-
contbytes += -AdvSimd.Arm64.AddAcross(AdvSimd.CompareLessThanOrEqual(Vector128.AsSByte(currentBlock), largestcont)).ToScalar();
1463-
Vector128<byte> largerthan0f = AdvSimd.CompareGreaterThan(currentBlock, fourthByteMinusOne);
1464-
if (largerthan0f != Vector128<byte>.Zero)
1476+
contv += AdvSimd.CompareLessThanOrEqual(Vector128.AsSByte(currentBlock), largestcont);
1477+
n4v += AdvSimd.CompareGreaterThan(currentBlock, fourthByteMinusOne).AsSByte();
1478+
overflowCounter++;
1479+
// We have a risk of overflow if overflowCounter reaches 255,
1480+
// in which case, we empty contv and n4v, and update contbytes and
1481+
// n4.
1482+
if (overflowCounter == 0xff)
14651483
{
1466-
byte n4add = (byte)AdvSimd.Arm64.AddAcross(largerthan0f).ToScalar();
1467-
int negn4add = (int)(byte)-n4add;
1468-
n4 += negn4add;
1484+
overflowCounter = 0;
1485+
contbytes += -AdvSimd.Arm64.AddAcrossWidening(contv).ToScalar();
1486+
contv = Vector128<sbyte>.Zero;
1487+
if (n4v != Vector128<sbyte>.Zero)
1488+
{
1489+
n4 += -AdvSimd.Arm64.AddAcrossWidening(n4v).ToScalar();
1490+
n4v = Vector128<sbyte>.Zero;
1491+
}
14691492
}
14701493
}
14711494
}
1472-
bool hasIncompete = (prevIncomplete != Vector128<byte>.Zero);
1495+
contbytes += -AdvSimd.Arm64.AddAcrossWidening(contv).ToScalar();
1496+
if (n4v != Vector128<sbyte>.Zero)
1497+
{
1498+
n4 += -AdvSimd.Arm64.AddAcrossWidening(n4v).ToScalar();
1499+
}
1500+
1501+
bool hasIncompete = (prevIncomplete != Vector128<byte>.Zero);
14731502
if (processedLength < inputLength || hasIncompete)
14741503
{
14751504
byte* invalidBytePointer;

0 commit comments

Comments
 (0)