@@ -266,10 +266,11 @@ sort helpers; for custom types the JIT can devirtualize `CompareTo` on value
266266types, keeping the call nearly as cheap.
267267
268268For ` byte ` and ` sbyte ` , the generator additionally emits SIMD vectorization
269- when available — AVX2 on x86 and AdvSimd (NEON) on ARM64. All 27-28 elements
270- fit in a single vector register, allowing each of the 13 network steps to
271- execute as a vectorized shuffle + min/max + blend operation instead of
272- individual scalar compare-and-swap branches.
269+ when available — AVX2 on x86 and AdvSimd (NEON) on ARM64. For sizes up to 32,
270+ all elements fit in a single vector register (or two on ARM64), allowing each
271+ network step to execute as a vectorized shuffle + min/max + blend operation.
272+ On ARM64, SIMD extends up to 64 elements using up to four ` Vector128<byte> `
273+ registers with single-group TBL4 lookups.
273274
274275For ` int ` and ` uint ` , AVX2 SIMD is emitted on x86 with four ` Vector256<int> `
275276registers (8 elements each). Cross-vector shuffles use ` PermuteVar8x32 ` with
@@ -279,21 +280,25 @@ On CPUs with AVX-512F, an AVX-512F path uses two `Vector512<int>` registers
279280
280281For ` short ` , ` ushort ` , and ` char ` (16-bit types), AVX-512 SIMD is emitted on x86
281282when available, packing all elements into a single ` Vector512<ushort> ` . On
282- ARM64, four ` Vector128<byte> ` vectors are used for the same 16-bit types.
283+ ARM64, ` Vector128<byte> ` registers are used for the same 16-bit types — up to
284+ four registers for sizes ≤32 and up to eight registers for sizes 33-64 using
285+ multi-stage TBL/TBX chains.
283286When all elements of a shuffled vector come from a single source register,
284287` Vector128.Shuffle ` (TBL1) is used; otherwise ` VectorTableLookup ` (TBL4)
285- provides cross-vector shuffles. This TBL1 optimization is critical for ARM64
286- processors like Ampere Altra/Neoverse where TBL4 has significantly higher
287- latency than TBL1. On platforms without SIMD support, this falls back to the
288- scalar unrolled sort.
289-
290- For ` int ` , ` uint ` , and ` float ` (32-bit types), ARM64 AdvSimd SIMD is emitted when
291- available. The 27-28 elements require seven ` Vector128 ` registers — exceeding
292- TBL4's 4-register table limit. When all elements of a shuffled vector come from
293- a single source register, ` Vector128.Shuffle ` (TBL1) is used directly; otherwise
294- a two-stage TBL/TBX lookup splits elements into Table A (0-15) and Table B
295- (16-27) with ` VectorTableLookupExtension ` (TBX) chaining. An early-exit check
296- detects already-sorted input and skips the SIMD path entirely.
288+ provides cross-vector shuffles, with ` VectorTableLookupExtension ` (TBX)
289+ chaining additional groups when registers exceed the 4-register TBL limit.
290+ This TBL1 optimization is critical for ARM64 processors like Ampere
291+ Altra/Neoverse where TBL4 has significantly higher latency than TBL1.
292+ On platforms without SIMD support, this falls back to the scalar unrolled sort.
293+
294+ For ` int ` , ` uint ` , and ` float ` (32-bit types), ARM64 AdvSimd SIMD is emitted for
295+ sizes up to 32, using up to eight ` Vector128 ` registers. For sizes 27-28, seven
296+ registers are used with two-stage TBL/TBX cross-vector shuffles. When all
297+ elements of a shuffled vector come from a single source register,
298+ ` Vector128.Shuffle ` (TBL1) is used directly; otherwise a multi-stage TBL/TBX
299+ lookup chains register groups. An early-exit check detects already-sorted input
300+ and skips the SIMD path entirely. For sizes beyond 32, multi-stage TBL overhead
301+ exceeds the SIMD benefit for 4-byte types, so the scalar unrolled path is used.
297302
298303For ` float ` , AVX2 SIMD uses four ` Vector256<float> ` registers
299304(8 elements each). Cross-vector shuffles use ` PermuteVar8x32 ` with
@@ -508,6 +513,27 @@ Apple Silicon. The TBL1 optimization for intra-register shuffles is critical her
508513> than ArraySort (97 ns). The optimization reduced it to 68 ns, a ** 2x
509514> improvement** that made GeneratedSort 1.6x faster than ArraySort.
510515
516+ #### Sizes 33-64 (ARM64 SIMD for byte/short)
517+
518+ For sizes 33-64, ARM64 SIMD is extended for ` byte ` /` sbyte ` (up to 4 registers,
519+ single TBL4 group) and ` short ` /` ushort ` /` char ` (up to 8 registers, multi-stage
520+ TBL/TBX). For ` int ` /` uint ` /` float ` , the scalar unrolled path is used since
521+ multi-stage TBL overhead exceeds SIMD benefit at these sizes:
522+
523+ | Type | Size | ArraySort | GeneratedSort | Speedup |
524+ | ---| ---| ---| ---| ---|
525+ | byte | 34 | 2,831 ns | 85 ns | ** 33x** |
526+ | sbyte | 38 | 3,340 ns | 94 ns | ** 36x** |
527+ | short | 40 | 3,439 ns | 164 ns | ** 21x** |
528+ | ushort | 42 | 3,671 ns | 198 ns | ** 19x** |
529+ | char | 60 | 292 ns | 216 ns | ** 1.4x** |
530+ | float | 36 | 3,269 ns | 220 ns | ** 15x** |
531+
532+ > ** Note:** ` float ` at size 36 uses the scalar unrolled path on ARM64 (not SIMD)
533+ > and is still 15x faster than ` Array.Sort ` . For types where .NET already has
534+ > SIMD-optimized sort (` int ` , ` uint ` , ` char ` ), the scalar network provides 1.3-1.4x
535+ > speedups at these sizes.
536+
511537### int detailed results (AVX2 SIMD)
512538
513539| Size | Kind | GeneratedSort | Ratio vs ArraySort |
@@ -539,7 +565,7 @@ Apple Silicon. The TBL1 optimization for intra-register shuffles is critical her
539565### Sizes 33-64 (x86, scalar unrolled)
540566
541567Networks for sizes 33-64 use best-known networks from [ Dobbelaere's SorterHunter] ( https://github.com/bertdobbelaere/SorterHunter ) .
542- These are scalar unrolled (no SIMD), but still significantly faster than ` Array.Sort ` / ` Span.Sort ` for most types:
568+ On x86, these are scalar unrolled (no SIMD). On ARM64, SIMD is used for ` byte ` / ` sbyte ` (up to 64 elements) and ` short ` / ` ushort ` / ` char ` (up to 64 elements) — see ARM64 section above. For all other types, the scalar path still provides significant speedups :
543569
544570| Type | Size | SpanSort | GeneratedSort | Speedup |
545571| ---| ---| ---| ---| ---|
@@ -578,7 +604,7 @@ dotnet run --project SortingNetworks.Benchmarks -c Release -- --filter *
578604 emits optimized sorting network code (scalar + SIMD)
579605- ** SortingNetworks.Tests** -- xUnit correctness tests covering sizes 2-64
580606 across all 13 primitive types plus custom types, with stress tests using
581- 100 random seeds (420 tests)
607+ 100 random seeds (419 tests)
582608- ** SortingNetworks.Benchmarks** -- BenchmarkDotNet benchmarks comparing
583609 generated sort vs ` Array.Sort ` for sizes 23-64 across all primitive types
584610 and custom record structs
0 commit comments