GH-47769: [C++] SVE dynamic dispatch by AntoinePrv · Pull Request #49756 · apache/arrow

AntoinePrv · 2026-04-15T09:39:38Z

Rationale for this change

Just like we dynamically dispatch to AVX2 on x86 CPUs, we want to dynamically dispatch to more advanced SIMD extension on ARM64 chips.

What changes are included in this PR?

A new macro to enable selecting the runtime SVE version
Detection of the ARM64 CPU features available at runtime
Adding SVE to the dynamic dispatch for bit unpacking algorithms.

Are these changes tested?

Are there any user-facing changes?

No.

GitHub Issue: [C++] Add dynamic dispatch on ARM64 #47769

github-actions · 2026-04-15T09:40:04Z

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

See also:

github-actions · 2026-04-15T09:40:33Z

⚠️ GitHub issue #47769 has been automatically assigned in GitHub to PR creator.

pitrou

Thanks for doing this! Here are a number of comments, questions, and suggestions.

pitrou · 2026-04-21T14:25:37Z

@@ -17,8 +17,10 @@

 #if defined(ARROW_HAVE_NEON)
 #  define UNPACK_PLATFORM unpack_neon


Can we just include bpacking_simd_internal.h and reuse the UNPACK_ARCH128 macro?

Possibly, though I thought it best that the macro is #undef at the end of the header (making it useless here).
We can make it more explicit (ARROW_BPACKING_UNPACK_ARCH128) an not undefining it.

pitrou · 2026-04-21T14:34:44Z

 #if defined(ARROW_HAVE_NEON)
+#  define UNPACK_ARCH128 unpack_neon
+#elif defined(ARROW_HAVE_SSE4_2)
+#  define UNPACK_ARCH128 unpack_sse4_2
+#endif


Relying on ARROW_HAVE_NEON etc. is why we need the "128 alt" case, right?

Perhaps we can also depend on which target the file is being compiled for.
For example we could have:

macro(append_runtime_sve128_src SRCS SRC) if(ARROW_HAVE_RUNTIME_SVE128) list(APPEND ${SRCS} ${SRC}) set_source_files_properties(${SRC} PROPERTIES COMPILE_OPTIONS "${ARROW_SVE128_FLAGS}" COMPILE_DEFINITIONS "ARROW_COMPILING_FOR_SVE128") endif() endmacro()

and then:

#if defined(ARROW_COMPILING_FOR_SVE128) # define UNPACK_ARCH128 unpack_sve128 #elif defined(ARROW_HAVE_NEON) # define UNPACK_ARCH128 unpack_neon #elif defined(ARROW_HAVE_SSE4_2) # define UNPACK_ARCH128 unpack_sse4_2 #endif

The issue is we need the file compiled twice in ARM (Neon + sve128).
I think it is not possible directly in CMake. The solution is copy to build tree then compile each with different flags.
Is a CMake-only solution satisfactory?

The issue is we need the file compiled twice in ARM (Neon + sve128).
I think it is not possible directly in CMake.

The easy workaround is to have the same .h file included in two different stub .cc files.

For example have bpacking_simd128_internal.h included by both bpacking_neon.cc and bpacking_sve128.cc.

AntoinePrv · 2026-04-21T16:33:11Z

@pitrou I definitely agree with the duplication of the different files, it's pretty tedious.
I think it will too large this PR, but we should definitely think of something, including providing some CMake utilities in xsimd.

pitrou · 2026-04-22T12:54:20Z

Something isn't quite right on ARM64 Ubuntu and ARM64 macOS. -march=armv8-a+sve is added to the default compiler flags even though we have ARROW_SIMD_LEVEL=NEON.

pitrou · 2026-04-22T13:01:17Z

-  return dispatch.func(in, out, opts);
-#endif
+  auto constexpr kImplementations = UnpackDynamicFunction<Uint>::implementations();
+  if constexpr (kImplementations.size() == 1) {


Is this condition actually useful? I guess it's a shortcut, but it's not obvious that it applies to common cases (x86 or ARM with default SIMD options).

At worse, this could be added generically to DynamicDispatch instead. But I doubt it's worth it.

It is worth it to avoid additional #ifdef, for instance on Macos there is only neon and no SVE (no need to dyn dispatch).
Previously we'd exclude the Neon version from the dynamic dispatch and go #ifdef ARROW_HAVE_NEON then go straight to Neon implementation.

At worse, this could be added generically to DynamicDispatch instead. But I doubt it's worth it.

Actually done in GH-49840 so either way here (we'd need to adapt the PR that is not merged first).

Actually done in GH-49840 so either way here

That PR might prove difficult to adapt for all the lousy compilers we have to support, so I'd rather focus on this one first :)

pitrou · 2026-04-23T13:25:38Z

    ->ArgsProduct(kBitWidthsNumValues64);
 #endif

+#if defined(ARROW_HAVE_RUNTIME_SVE128)


I wonder if there's an easy way to reduce the duplication we're doing for each runtime SIMD level?

For example if we could write something like:

BENCHMARK_SIMD_UNPACK(Bool, bool, SVE128, Sve128, sve128);

and it would expand to:

BENCHMARK_CAPTURE(BM_UnpackBool, Sve128Unaligned, false, &bpacking::unpack_sve128<bool>, !CpuInfo::GetInstance()->IsSupported(CpuInfo::SVE128), "Sve128 not available") ->ArgsProduct(kBitWidthsNumValues<bool>);

You mean with a macro?

pitrou · 2026-04-23T13:54:18Z

@AntoinePrv Is it possible to run some ARM benchmarks and paste the results somewhere once you're satisfied with the PR?

Co-authored-by: Antoine Pitrou <pitrou@free.fr>

AntoinePrv · 2026-04-23T15:03:05Z

@pitrou here are some benchmarks in the [0, 500] num of integer to unpack that Arrow is operating over.

SVE128: linux-arm64-graviton4.pdf
SVE256: linux-arm64-graviton3.pdf

As before, we can sometimes be much penalized by small sizes.
Surprisingly (annoyingly?) there are some bit width numbers (e.g. 3, 5, 6, 7 on SVE256) where the world is upside down: scalar does best, then Neon, then SVE (in the [0, 500] range, it does not hold for larger input buffers).

pitrou · 2026-04-23T15:46:44Z

Surprisingly (annoyingly?) there are some bit width numbers (e.g. 3, 5, 6, 7 on SVE256) where the world is upside down: scalar does best, then Neon, then SVE (in the [0, 500] range, it does not hold for larger input buffers).

Perhaps because of larger vectors and a slow epilogue?

pitrou · 2026-04-23T15:47:38Z

But it's impressive that SVE128 is always significantly better than NEON. That's rather good news, given that most SVE implementations have 128-bit vectors. @cyb70289

pitrou · 2026-04-23T15:47:51Z

@ursabot please benchmark

rok · 2026-04-23T15:47:56Z

Benchmark runs are scheduled for commit cbf526f. Watch https://buildkite.com/apache-arrow and https://conbench.arrow-dev.org for updates. A comment will be posted here when the runs are complete.

pitrou · 2026-04-23T15:50:59Z

Silly me, I started the continuous benchmarking suite but our ARM platform there (arm64-t4g-2xlarge) uses Graviton 2 CPUs which don't support SVE.

conbench-apache-arrow · 2026-04-23T22:05:03Z

Thanks for your patience. Conbench analyzed the 2 benchmarking runs that have been run so far on PR commit cbf526f.

There were 14 benchmark results indicating a performance regression:

Pull Request Run on amd64-c6a-4xlarge-linux at 2026-04-23 16:51:19Z
- BM_UnpackUint16 (C++) with params=Avx2Unaligned/13/256, source=cpp-micro, suite=arrow-bpacking-benchmark
- BM_UnpackUint16 (C++) with params=DynamicUnaligned/13/64, source=cpp-micro, suite=arrow-bpacking-benchmark
and 12 more (see the report linked below)

The full Conbench report has more details.

cyb70289 · 2026-04-24T03:15:31Z

But it's impressive that SVE128 is always significantly better than NEON. That's rather good news, given that most SVE implementations have 128-bit vectors. @cyb70289

Interesting. It should not happen if both using equivalent simd operations.
I tested one case BM_UnpackBool/{Neon,Sve128}Unaligned/1/32 on an Neoverse N2 server, SVE shows double performance than Neon. But from profile result, looks Neon code does not inline frequently called functions like load_val_as and introduces high overhead.

Benchmark

BM_UnpackBool/NeonUnaligned/1/32              11.1 ns         11.0 ns     63329847 items_per_second=2.89667G/s
BM_UnpackBool/Sve128Unaligned/1/32            7.02 ns         7.02 ns     99743767 items_per_second=4.55851G/s

Neon hotspot shows load_val_as is not inlined

+   45.20%  arrow-bpacking-  libarrow.so.2500.0.0      [.] void arrow::internal::bpacking::unpack_width<1, arrow::internal::bpacking::KernelNeon, bool>(unsigned char const
+   31.03%  arrow-bpacking-  libarrow.so.2500.0.0      [.] xsimd::batch<unsigned char, xsimd::neon64> arrow::internal::bpacking::load_val_as<unsigned int, xsimd::neon64>(u
+    7.08%  arrow-bpacking-  libarrow.so.2500.0.0      [.] void arrow::internal::bpacking::MediumKernel<arrow::internal::bpacking::KernelTraits<bool, 1, xsimd::neon64>, ar
+    6.38%  arrow-bpacking-  libarrow.so.2500.0.0      [.] void arrow::internal::bpacking::MediumKernel<arrow::internal::bpacking::KernelTraits<bool, 1, xsimd::neon64>, ar
+    2.63%  arrow-bpacking-  libarrow.so.2500.0.0      [.] void arrow::internal::bpacking::unpack_neon<bool>(unsigned char const*, bool*, arrow::internal::UnpackOptions co
+    1.95%  arrow-bpacking-  arrow-bpacking-benchmark  [.] arrow::internal::(anonymous namespace)::BM_UnpackBool(benchmark::State&, bool, void (*)(unsigned char const*, bo
+    1.82%  arrow-bpacking-  libarrow.so.2500.0.0      [.] xsimd::batch<unsigned char, xsimd::neon64> arrow::internal::bpacking::load_val_as<unsigned int, xsimd::neon64>(u
+    1.37%  arrow-bpacking-  libarrow.so.2500.0.0      [.] void arrow::internal::bpacking::unpack_width<1, arrow::internal::bpacking::KernelNeon, bool>(unsigned char const

No such issue in sve128 code path

+   89.89%  arrow-bpacking-  libarrow.so.2500.0.0      [.] void arrow::internal::bpacking::unpack_width<1, arrow::int◆
+    4.18%  arrow-bpacking-  libarrow.so.2500.0.0      [.] void arrow::internal::bpacking::unpack_sve128<bool>(unsign▒
+    3.11%  arrow-bpacking-  arrow-bpacking-benchmark  [.] arrow::internal::(anonymous namespace)::BM_UnpackBool(benc▒
+    1.47%  arrow-bpacking-  libarrow.so.2500.0.0      [.] void arrow::internal::bpacking::unpack_width<1, arrow::int▒

AntoinePrv · 2026-04-24T09:03:00Z

That is interesting, I was also investigating a std::memcpy not inlined in the epilogue.

AntoinePrv changed the title ~~Sve dynamic dispatch~~ GH-47769: [C++] Sve dynamic dispatch Apr 15, 2026

github-actions Bot added Component: C++ awaiting review Awaiting review labels Apr 15, 2026

AntoinePrv force-pushed the sve-dispatch branch 4 times, most recently from 2925550 to ff8566b Compare April 21, 2026 12:47

AntoinePrv marked this pull request as ready for review April 21, 2026 14:01

pitrou changed the title ~~GH-47769: [C++] Sve dynamic dispatch~~ GH-47769: [C++] SVE dynamic dispatch Apr 21, 2026

pitrou reviewed Apr 21, 2026

View reviewed changes

github-actions Bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Apr 21, 2026

AntoinePrv force-pushed the sve-dispatch branch from ff8566b to 23dc5f0 Compare April 22, 2026 08:38

pitrou added the CI: Extra: C++ Run extra C++ CI label Apr 22, 2026

pitrou reviewed Apr 22, 2026

View reviewed changes

Comment thread cpp/cmake_modules/SetupCxxFlags.cmake Outdated

pitrou reviewed Apr 22, 2026

View reviewed changes

Comment thread cpp/cmake_modules/SetupCxxFlags.cmake

pitrou reviewed Apr 22, 2026

View reviewed changes

Comment thread cpp/src/arrow/util/bpacking_benchmark.cc Outdated

AntoinePrv mentioned this pull request Apr 22, 2026

GH-49835: [C++] A constexpr dynamic dispatch with static dispatch when possible #49840

Open

AntoinePrv force-pushed the sve-dispatch branch from 23dc5f0 to a8f0ddf Compare April 23, 2026 10:06

github-actions Bot removed the CI: Extra: C++ Run extra C++ CI label Apr 23, 2026

AntoinePrv added 5 commits April 23, 2026 13:39

Add SVE dynamic dispatch

f2be539

Add ARROW_HAVE_RUNTIME_SVE

1d18257

Add bpacking Sve 256 dynamic dispatch

61fa3dc

Add SVE128

9b7e49c

Prevent ODR violations

9c4ec4b

AntoinePrv added 9 commits April 23, 2026 13:39

Allow SVE cmake options

26ed7eb

Do not support multiple SVE sizes

3a3cccb

Simpler benchmark sizes

2f69dcd

Add missing SVE128 dynamic dispatch

f031c68

Disable SVE dynamic dispatch on Windows

8191258

Hard error if SIMD arch not defined in dyn dispatch files

9032864

Fix SVE build flags

b04f250

Use type test suite

9669fc9

Fix Meson build

80e0ab2

AntoinePrv force-pushed the sve-dispatch branch from a8f0ddf to 80e0ab2 Compare April 23, 2026 11:40

pitrou added the CI: Extra: C++ Run extra C++ CI label Apr 23, 2026

pitrou reviewed Apr 23, 2026

View reviewed changes

AntoinePrv and others added 2 commits April 23, 2026 15:55

Update cpp/cmake_modules/SetupCxxFlags.cmake

b1cb14b

Co-authored-by: Antoine Pitrou <pitrou@free.fr>

Update cpp/src/arrow/util/bpacking_test.cc

cbf526f

Co-authored-by: Antoine Pitrou <pitrou@free.fr>

github-actions Bot removed the CI: Extra: C++ Run extra C++ CI label Apr 23, 2026

AntoinePrv added 3 commits April 24, 2026 16:06

Factor bpacking benchmarks with a macro

defc062

Aggressive inlining

1ebc10c

memcpy with compile time constant

7ecbd23

		@@ -17,8 +17,10 @@

		#if defined(ARROW_HAVE_NEON)
		# define UNPACK_PLATFORM unpack_neon

Conversation

AntoinePrv commented Apr 15, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

github-actions Bot commented Apr 15, 2026

Uh oh!

github-actions Bot commented Apr 15, 2026

Uh oh!

pitrou left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

AntoinePrv commented Apr 21, 2026

Uh oh!

pitrou commented Apr 22, 2026

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

pitrou commented Apr 23, 2026

Uh oh!

AntoinePrv commented Apr 23, 2026

Uh oh!

pitrou commented Apr 23, 2026

Uh oh!

pitrou commented Apr 23, 2026

Uh oh!

pitrou commented Apr 23, 2026

Uh oh!

rok commented Apr 23, 2026

Uh oh!

pitrou commented Apr 23, 2026

Uh oh!

conbench-apache-arrow Bot commented Apr 23, 2026

Uh oh!

cyb70289 commented Apr 24, 2026

Uh oh!

AntoinePrv commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

AntoinePrv commented Apr 15, 2026 •

edited by github-actions Bot

Loading