Add WASM SIMD C implementation #459

mimi89999 · 2025-03-29T20:20:50Z

~~I tested it in NodeJS and it seems 2x slower 😞~~

Update: It's faster!

mimi89999 · 2025-03-29T20:43:12Z

The bad performance was caused by a small bug. Now the WASM SIMD implementation is barely faster than the portable one.

monoid · 2025-03-29T21:11:48Z

What is your result with -O3?

mimi89999 · 2025-03-29T21:14:02Z

What is your result with -O3?

Exactly the same

mimi89999 · 2025-03-29T22:17:03Z

Here is my benchmark. It's very primitive:

#include "blake3.h"

#include <stdio.h>
#include <string.h>

int main() {
  char hello[] = "6OMPP7PLnNz5EzdPaBR7QCcqddoaFBKhFSixrPfZDiVvtuAg7haIm66xafc9nRxlDAIlgIg7VGQw77La6dA3g2qDZyHH9OnoKgTSwfCwIujTXCnN6NSG2RbAyLf8M1YMNGvZsrhrvODEaUxwpvKgKRpVXdzt8ber6aYr9PX95De4zBjHBuGaPh2YdmnYyPhf5NmeHnf42UUn8R2NI7tYM4PKgucZXonqNb3e2J0Uad9TYiJ1dVIO8qsa4ZqGOEeJfKuzwRmY74rNyPWq6rHIhC6BwJk02buI3S2JxEfL0ZLnjo0gMqsFhETfj3Mrm83iwFz7oIEoMs0tGAO4BOwvNQ1vygjDHoAqRb7XDi7wvB96jlVcbo93wCzQA8xwhxjlgxxgzbXUhzq1BeFQu5ajG3QiUs4MlBrT3hoUFcHexfQg7xa39iGYd3krhdNWkahKKR3wB4O8ut71hFHXHM5JEsAGcF59gqI9qKWvTNhANr2t11n7l06CoMqDvGMmcXri";

  uint8_t output[BLAKE3_OUT_LEN];

  for (unsigned int i = 0; i < 100000000; i++) {
    blake3_hasher hasher;
    blake3_hasher_init(&hasher);

    blake3_hasher_update(&hasher, hello, strlen(hello));

    blake3_hasher_finalize(&hasher, output, BLAKE3_OUT_LEN);
  }

  for (size_t i = 0; i < BLAKE3_OUT_LEN; i++) {
    printf("%02x", output[i]);
  }
  printf("\n");

  return 0;
}

michel@debian:~/git/BLAKE3/c$ emcc -O2 test.c blake3.c blake3_dispatch.c blake3_portable.c -o test.js
michel@debian:~/git/BLAKE3/c$ time nodejs test.js 
b0e1430ecdd18b09c5834a6ddbbc741dcf01df6ed0a1552cb8a7c0eeb31de404

real	2m27,537s
user	2m27,510s
sys	0m0,037s
michel@debian:~/git/BLAKE3/c$ emcc -O2 -msimd128 test.c blake3.c blake3_dispatch.c blake3_portable.c blake3_wasm32_simd.c -o test.js
michel@debian:~/git/BLAKE3/c$ time nodejs test.js 
b0e1430ecdd18b09c5834a6ddbbc741dcf01df6ed0a1552cb8a7c0eeb31de404

real	1m56,135s
user	1m56,128s
sys	0m0,017s

monoid · 2025-03-29T22:18:50Z

Try moving strlen out of the for loop.

mimi89999 · 2025-03-29T22:26:43Z

@monoid That should not affect the difference between the two implementations. Could you check how it compares with the Rust WASM and Rust WASM SIMD implementations?

monoid · 2025-03-30T04:44:03Z

Same problem with Rust implementation. One needs longer data to see a difference. Start with something around 8-16Kb.

mimi89999 · 2025-03-30T11:41:03Z

I see that the performance is near native on x86 compared to BLAKE3 with SSE2 only.

monoid · 2025-03-30T11:47:12Z

Sounds great! Meanwhile, I've managed to start a quickcheck to compare your implementation with a native one.

mimi89999 · 2025-04-01T05:20:31Z

Here are some benchmarks:

michel@debian:~/git/BLAKE3/c$ emcc -O2 -msimd128 -sSTACK_SIZE=1114112 test.c blake3.c blake3_dispatch.c blake3_portable.c blake3_wasm32_simd.c -o test.js
michel@debian:~/git/BLAKE3/c$ nodejs test.js 
b5358909f8bed53f55bf9324e290e9a5a585de8b0239d18040e9d3b0c7e8f9cf
Time: 1.950000
Data size: 1048576
Rounds: 1000
Time per round: 0.001950
michel@debian:~/git/BLAKE3/c$ emcc -O2 -sSTACK_SIZE=1114112 test.c blake3.c blake3_dispatch.c blake3_portable.c -o test.js
michel@debian:~/git/BLAKE3/c$ nodejs test.js 
b5358909f8bed53f55bf9324e290e9a5a585de8b0239d18040e9d3b0c7e8f9cf
Time: 2.253000
Data size: 1048576
Rounds: 1000
Time per round: 0.002253
michel@debian:~/git/BLAKE3/c$ vim test.c
michel@debian:~/git/BLAKE3/c$ emcc -O2 -msimd128 -sSTACK_SIZE=1114112 test.c blake3.c blake3_dispatch.c blake3_portable.c blake3_wasm32_simd.c -o test.js
michel@debian:~/git/BLAKE3/c$ nodejs test.js 
b5358909f8bed53f55bf9324e290e9a5a585de8b0239d18040e9d3b0c7e8f9cf
Time: 19.468000
Data size: 1048576
Rounds: 10000
Time per round: 0.001947
michel@debian:~/git/BLAKE3/c$ emcc -O2 -sSTACK_SIZE=1114112 test.c blake3.c blake3_dispatch.c blake3_portable.c -o test.js
michel@debian:~/git/BLAKE3/c$ nodejs test.js 
b5358909f8bed53f55bf9324e290e9a5a585de8b0239d18040e9d3b0c7e8f9cf
Time: 22.336000
Data size: 1048576
Rounds: 10000
Time per round: 0.002234

mimi89999 · 2025-04-01T06:12:12Z

@oconnor663 After some benchmarking, I can't replicate the huge performance improvements of using SIMD instructions on my machine. Nevertheless, it's still a 15% improvement. I think that it's worth adding it. Maybe on other machines, the difference will be much bigger. 🤔

mimi89999 mentioned this pull request Mar 29, 2025

add an implementation using Wasm SIMD #187

Closed

Add WASM SIMD C implementation

ec6a10e

mimi89999 force-pushed the wasm_simd_c branch from 353ef60 to ec6a10e Compare March 30, 2025 11:38

mimi89999 changed the title ~~WIP: Add WASM SIMD C implementation~~ Add WASM SIMD C implementation Mar 30, 2025

mimi89999 mentioned this pull request Mar 31, 2025

Consider switching to a WASM SIMD implementation Daninet/hash-wasm#66

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add WASM SIMD C implementation #459

Add WASM SIMD C implementation #459

mimi89999 commented Mar 29, 2025 •

edited

Loading

mimi89999 commented Mar 29, 2025

monoid commented Mar 29, 2025

mimi89999 commented Mar 29, 2025

mimi89999 commented Mar 29, 2025

monoid commented Mar 29, 2025

mimi89999 commented Mar 29, 2025

monoid commented Mar 30, 2025

mimi89999 commented Mar 30, 2025

monoid commented Mar 30, 2025

mimi89999 commented Apr 1, 2025

mimi89999 commented Apr 1, 2025

Add WASM SIMD C implementation #459

Are you sure you want to change the base?

Add WASM SIMD C implementation #459

Conversation

mimi89999 commented Mar 29, 2025 • edited Loading

mimi89999 commented Mar 29, 2025

monoid commented Mar 29, 2025

mimi89999 commented Mar 29, 2025

mimi89999 commented Mar 29, 2025

monoid commented Mar 29, 2025

mimi89999 commented Mar 29, 2025

monoid commented Mar 30, 2025

mimi89999 commented Mar 30, 2025

monoid commented Mar 30, 2025

mimi89999 commented Apr 1, 2025

mimi89999 commented Apr 1, 2025

mimi89999 commented Mar 29, 2025 •

edited

Loading