Description
I noticed the upcoming 4.0 release has removed the u32_backend and u64_backend features in favor of checking the target pointer size, which seems very sensible.* We know the u32_backend can be drastically faster than the u64_backend in some configurations:
On my Xperia 10 with SailfishOS (armv7hl user space on aarch64 kernel, yes I know, rustflags = "-C target-feature=+v7,+neon"):
- u32_backend + std: [644.25 us 646.59 us 649.28 us]
- u64_backend + std: [6.4522 ms 6.4685 ms 6.4880 ms]
…but not all:
On a Cortex-M3 (Texas Instruments CC2538):
- u32_backend without alloc: 470507 cycles
- u32_backend with alloc (wut): 878507 cycles
- u64_backend without alloc: 494078 cycles
- u64_backend with alloc: 470502 cycles (the fastest)
(signalapp/libsignal#453 (comment))
Today I finally got around to running some 32-bit microbenchmarks on a new Android phone (with help from a coworker), a Pixel 6a, which I feel should nonetheless still produce interesting results because, well, running armv7 code on an aarch64 device can't magically use 64-bit registers. We consistently found that the u32_backend was 5-15% slower than the u64_backend when compiling for that particular armv7.
EDIT: Later, I tested on a 32-bit Moto X and got approximately the same results.
All of this is differences of at most a tenth of a millisecond on any particular operation (we tested key agreement and signing + verifying), so it's not like it's going to make or break the crate. But it did make me think twice about "u32_backend is for 32-bit CPUs, u64_backend is for 64-bit CPUs". Is that worth restoring feature flags to override the default inference?
* Especially given that the previous config made it easy for a downstream crate to depend on u64_backend implicitly as a default feature, thus removing the chance for the final program to choose the u32_backend instead.