Summary
I’ve been evaluating the Rust blake3 crate for potential use. Data, code, and benchmarks are available here.
Results
- x86:
blake3 is consistently faster than blake2.
- ppcle / s390x / aarch64: performance is generally slower than
blake2.
- Rayon parallelism sometimes improves results.
- In some cases, performance is still worse (example).
Questions
- Is this expected behavior on non-x86 architectures (e.g., SIMD gaps, missing intrinsics)?
- Or is my sample code / benchmarking harness flawed?
- Are there recommended tuning options or build flags for ppcle, s390x, and aarch64?