Skip to content

Commit ac28751

Browse files
Update README with ARM64 SIMD sizes 33-64 benchmarks
Document that byte/sbyte extend to 64 elements (4 regs, single TBL group) and short/ushort/char extend to 64 elements (8 regs, multi-stage TBL/TBX) on ARM64. Note that int/uint/float cap at 32 due to TBL overhead. Add benchmark table showing 19-36x speedups for byte/short at sizes 33-64 on Ampere Neoverse-N2 ARM64. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
1 parent 9da06ae commit ac28751

1 file changed

Lines changed: 45 additions & 19 deletions

File tree

README.md

Lines changed: 45 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -266,10 +266,11 @@ sort helpers; for custom types the JIT can devirtualize `CompareTo` on value
266266
types, keeping the call nearly as cheap.
267267

268268
For `byte` and `sbyte`, the generator additionally emits SIMD vectorization
269-
when available — AVX2 on x86 and AdvSimd (NEON) on ARM64. All 27-28 elements
270-
fit in a single vector register, allowing each of the 13 network steps to
271-
execute as a vectorized shuffle + min/max + blend operation instead of
272-
individual scalar compare-and-swap branches.
269+
when available — AVX2 on x86 and AdvSimd (NEON) on ARM64. For sizes up to 32,
270+
all elements fit in a single vector register (or two on ARM64), allowing each
271+
network step to execute as a vectorized shuffle + min/max + blend operation.
272+
On ARM64, SIMD extends up to 64 elements using up to four `Vector128<byte>`
273+
registers with single-group TBL4 lookups.
273274

274275
For `int` and `uint`, AVX2 SIMD is emitted on x86 with four `Vector256<int>`
275276
registers (8 elements each). Cross-vector shuffles use `PermuteVar8x32` with
@@ -279,21 +280,25 @@ On CPUs with AVX-512F, an AVX-512F path uses two `Vector512<int>` registers
279280

280281
For `short`, `ushort`, and `char` (16-bit types), AVX-512 SIMD is emitted on x86
281282
when available, packing all elements into a single `Vector512<ushort>`. On
282-
ARM64, four `Vector128<byte>` vectors are used for the same 16-bit types.
283+
ARM64, `Vector128<byte>` registers are used for the same 16-bit types — up to
284+
four registers for sizes ≤32 and up to eight registers for sizes 33-64 using
285+
multi-stage TBL/TBX chains.
283286
When all elements of a shuffled vector come from a single source register,
284287
`Vector128.Shuffle` (TBL1) is used; otherwise `VectorTableLookup` (TBL4)
285-
provides cross-vector shuffles. This TBL1 optimization is critical for ARM64
286-
processors like Ampere Altra/Neoverse where TBL4 has significantly higher
287-
latency than TBL1. On platforms without SIMD support, this falls back to the
288-
scalar unrolled sort.
289-
290-
For `int`, `uint`, and `float` (32-bit types), ARM64 AdvSimd SIMD is emitted when
291-
available. The 27-28 elements require seven `Vector128` registers — exceeding
292-
TBL4's 4-register table limit. When all elements of a shuffled vector come from
293-
a single source register, `Vector128.Shuffle` (TBL1) is used directly; otherwise
294-
a two-stage TBL/TBX lookup splits elements into Table A (0-15) and Table B
295-
(16-27) with `VectorTableLookupExtension` (TBX) chaining. An early-exit check
296-
detects already-sorted input and skips the SIMD path entirely.
288+
provides cross-vector shuffles, with `VectorTableLookupExtension` (TBX)
289+
chaining additional groups when registers exceed the 4-register TBL limit.
290+
This TBL1 optimization is critical for ARM64 processors like Ampere
291+
Altra/Neoverse where TBL4 has significantly higher latency than TBL1.
292+
On platforms without SIMD support, this falls back to the scalar unrolled sort.
293+
294+
For `int`, `uint`, and `float` (32-bit types), ARM64 AdvSimd SIMD is emitted for
295+
sizes up to 32, using up to eight `Vector128` registers. For sizes 27-28, seven
296+
registers are used with two-stage TBL/TBX cross-vector shuffles. When all
297+
elements of a shuffled vector come from a single source register,
298+
`Vector128.Shuffle` (TBL1) is used directly; otherwise a multi-stage TBL/TBX
299+
lookup chains register groups. An early-exit check detects already-sorted input
300+
and skips the SIMD path entirely. For sizes beyond 32, multi-stage TBL overhead
301+
exceeds the SIMD benefit for 4-byte types, so the scalar unrolled path is used.
297302

298303
For `float`, AVX2 SIMD uses four `Vector256<float>` registers
299304
(8 elements each). Cross-vector shuffles use `PermuteVar8x32` with
@@ -508,6 +513,27 @@ Apple Silicon. The TBL1 optimization for intra-register shuffles is critical her
508513
> than ArraySort (97 ns). The optimization reduced it to 68 ns, a **2x
509514
> improvement** that made GeneratedSort 1.6x faster than ArraySort.
510515
516+
#### Sizes 33-64 (ARM64 SIMD for byte/short)
517+
518+
For sizes 33-64, ARM64 SIMD is extended for `byte`/`sbyte` (up to 4 registers,
519+
single TBL4 group) and `short`/`ushort`/`char` (up to 8 registers, multi-stage
520+
TBL/TBX). For `int`/`uint`/`float`, the scalar unrolled path is used since
521+
multi-stage TBL overhead exceeds SIMD benefit at these sizes:
522+
523+
| Type | Size | ArraySort | GeneratedSort | Speedup |
524+
|---|---|---|---|---|
525+
| byte | 34 | 2,831 ns | 85 ns | **33x** |
526+
| sbyte | 38 | 3,340 ns | 94 ns | **36x** |
527+
| short | 40 | 3,439 ns | 164 ns | **21x** |
528+
| ushort | 42 | 3,671 ns | 198 ns | **19x** |
529+
| char | 60 | 292 ns | 216 ns | **1.4x** |
530+
| float | 36 | 3,269 ns | 220 ns | **15x** |
531+
532+
> **Note:** `float` at size 36 uses the scalar unrolled path on ARM64 (not SIMD)
533+
> and is still 15x faster than `Array.Sort`. For types where .NET already has
534+
> SIMD-optimized sort (`int`, `uint`, `char`), the scalar network provides 1.3-1.4x
535+
> speedups at these sizes.
536+
511537
### int detailed results (AVX2 SIMD)
512538

513539
| Size | Kind | GeneratedSort | Ratio vs ArraySort |
@@ -539,7 +565,7 @@ Apple Silicon. The TBL1 optimization for intra-register shuffles is critical her
539565
### Sizes 33-64 (x86, scalar unrolled)
540566

541567
Networks for sizes 33-64 use best-known networks from [Dobbelaere's SorterHunter](https://github.com/bertdobbelaere/SorterHunter).
542-
These are scalar unrolled (no SIMD), but still significantly faster than `Array.Sort` / `Span.Sort` for most types:
568+
On x86, these are scalar unrolled (no SIMD). On ARM64, SIMD is used for `byte`/`sbyte` (up to 64 elements) and `short`/`ushort`/`char` (up to 64 elements) — see ARM64 section above. For all other types, the scalar path still provides significant speedups:
543569

544570
| Type | Size | SpanSort | GeneratedSort | Speedup |
545571
|---|---|---|---|---|
@@ -578,7 +604,7 @@ dotnet run --project SortingNetworks.Benchmarks -c Release -- --filter *
578604
emits optimized sorting network code (scalar + SIMD)
579605
- **SortingNetworks.Tests** -- xUnit correctness tests covering sizes 2-64
580606
across all 13 primitive types plus custom types, with stress tests using
581-
100 random seeds (420 tests)
607+
100 random seeds (419 tests)
582608
- **SortingNetworks.Benchmarks** -- BenchmarkDotNet benchmarks comparing
583609
generated sort vs `Array.Sort` for sizes 23-64 across all primitive types
584610
and custom record structs

0 commit comments

Comments
 (0)