Skip to content

SIMD: AVX-512 VBMI as primary path for all byte/sbyte sizes#107

Merged
jonathanpeppers merged 5 commits into
mainfrom
jonathanpeppers/avx512-vbmi-byte-sbyte
May 3, 2026
Merged

SIMD: AVX-512 VBMI as primary path for all byte/sbyte sizes#107
jonathanpeppers merged 5 commits into
mainfrom
jonathanpeppers/avx512-vbmi-byte-sbyte

Conversation

@jonathanpeppers

@jonathanpeppers jonathanpeppers commented May 3, 2026

Copy link
Copy Markdown
Owner

Summary

Make AVX-512 VBMI the primary SIMD path for all byte/sbyte sizes (8-64), with AVX2 as a fallback for sizes 8-32 on hardware without VBMI support.

Fixes #29

Motivation

Previously, byte/sbyte sizes ≤32 only had an AVX2 path using complex cross-lane shuffling (Permute2x128 + dual vpshufb + Or). Sizes 33-64 already used the much simpler VBMI PermuteVar64x8. This change makes VBMI the preferred path for all sizes, following the same primary/fallback pattern already used for:

  • short/ushort/char: AVX-512 BW primary → AVX2 fallback
  • double: AVX-512F primary → AVX2 fallback

Changes

All changes are in SimdX86Emitter.cs:

Area Change
GetGuardCondition Always returns Avx512Vbmi.IsSupported for 1-byte types (was conditional on size > 32)
CanEmitAvx2Fallback Added byte/sbyte support for sizes 8-32
EmitAvx2Fallback Routes byte types to new EmitByteAvx2 method
Emit() Routes all byte sizes to EmitByteAvx512Vbmi
EmitByteEmitByteAvx2 Renamed; generates SortSimdAvx2_ fallback methods
EmitByteAvx512Vbmi Extended to handle sizes 8-32 (Vector128/256 → Vector512 zero-extension)

Generated dispatch (example)

if (Avx512Vbmi.IsSupported) {
    if (n == 8)  { SortSimd8_byte(span); return; }   // VBMI Vector512
    if (n == 16) { SortSimd16_byte(span); return; }  // VBMI Vector512
    if (n == 48) { SortSimd48_byte(span); return; }  // VBMI Vector512
}
else if (Avx2.IsSupported) {
    if (n == 8)  { SortSimdAvx2_8_byte(span); return; }   // AVX2 Vector256
    if (n == 16) { SortSimdAvx2_16_byte(span); return; }  // AVX2 Vector256
}
if (AdvSimd.Arm64.IsSupported) { ... }
// scalar fallback

Benchmark Results (AMD EPYC 9V74, AVX-512 VBMI)

byte

Size ArraySort GeneratedSort Speedup
23 1,028 ns 55 ns 19x
27 1,250 ns 53 ns 24x
28 1,415 ns 54 ns 26x
32 1,516 ns 54 ns 28x
34 1,759 ns 64 ns 27x

sbyte

Size ArraySort GeneratedSort Speedup
27 1,355 ns 57 ns 24x
28 1,495 ns 58 ns 26x
32 1,598 ns 58 ns 28x
38 2,160 ns 68 ns 32x

All elements fit in a single Vector512<byte> with PermuteVar64x8 shuffles. Zero allocations. On CPUs without VBMI, sizes 8-32 fall back to AVX2.

Testing

All 455 tests pass across all four CI platforms (ubuntu x64, ubuntu ARM, windows, macOS).

Make AVX-512 VBMI the primary SIMD path for all byte/sbyte sizes (8-64),
with AVX2 as a fallback for sizes 8-32 on hardware without VBMI support.

Previously, byte/sbyte sizes ≤32 only had an AVX2 path using complex
cross-lane shuffling (Permute2x128 + dual vpshufb + Or). Sizes 33-64
already used the simpler VBMI PermuteVar64x8. This change makes VBMI
the preferred path for all sizes, following the same primary/fallback
pattern used for short (AVX-512 BW → AVX2) and double (AVX-512F → AVX2).

Changes in SimdX86Emitter.cs:
- GetGuardCondition: always returns Avx512Vbmi for 1-byte types
- CanEmitAvx2Fallback: add byte/sbyte support for sizes 8-32
- EmitAvx2Fallback: route byte types to new EmitByteAvx2 method
- Emit: route all byte sizes to EmitByteAvx512Vbmi
- Rename EmitByte → EmitByteAvx2 (AVX2 fallback, SortSimdAvx2_ naming)
- Extend EmitByteAvx512Vbmi to handle sizes 8-32 (Vector128/256 → Vector512)

Fixes #29

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings May 3, 2026 01:01
Keep our updated case 1 comment (VBMI primary for all sizes) and
take main's updated case 2 (64 elements with PermuteVar32x16x2).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the x86 SIMD code generator so byte/sbyte sorting networks prefer AVX-512 VBMI across the full supported size range, while still generating AVX2 fallbacks for smaller sizes on machines without VBMI. It fits into the generator’s existing pattern of “newer AVX-512 primary path, older AVX2 fallback” used for other element widths.

Changes:

  • Switched byte/sbyte primary x86 guard/dispatch from size-dependent AVX2-or-VBMI logic to VBMI for all supported byte widths.
  • Added AVX2 fallback emission for byte/sbyte sizes 8-32, including distinct SortSimdAvx2_* method generation.
  • Extended the VBMI byte emitter to load/store sizes 8-32 by zero-extending smaller vectors into Vector512<byte>.

Comment thread SortingNetworks.Generators/SimdX86Emitter.cs
Comment thread SortingNetworks.Generators/SimdX86Emitter.cs
jonathanpeppers and others added 3 commits May 2, 2026 20:10
- Add SimdCode_8Bit_HasAvx2Fallback generator test for byte/sbyte AVX2
  fallback (sizes 8, 16, 28, 32) verifying both VBMI and AVX2 dispatch
- Add (32, byte) to SimdCode_Compiles InlineData
- Add SortingNetwork(32) for byte and sbyte in GeneratedSorters.cs
- Add Sort_32Elements_Byte and Sort_32Elements_SByte stress tests

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Keep size 32 sbyte from branch, add sizes 48/64 sbyte from main.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Update SIMD example to show AVX-512 VBMI PermuteVar64x8 (was AVX2)
- Update Design section: VBMI primary, AVX2 fallback for byte/sbyte
- Update AVX-512 benchmarks with VBMI results across sizes 23-38

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@jonathanpeppers jonathanpeppers merged commit 403d245 into main May 3, 2026
6 checks passed
@jonathanpeppers jonathanpeppers deleted the jonathanpeppers/avx512-vbmi-byte-sbyte branch May 3, 2026 02:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

SIMD: AVX-512 VBMI for byte/sbyte (single Vector512, simplify AVX2 path)

2 participants