Skip to content

Add WASM SIMD C implementation #459

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

mimi89999
Copy link

@mimi89999 mimi89999 commented Mar 29, 2025

I tested it in NodeJS and it seems 2x slower 😞

Update: It's faster!

@mimi89999
Copy link
Author

The bad performance was caused by a small bug. Now the WASM SIMD implementation is barely faster than the portable one.

@monoid
Copy link
Contributor

monoid commented Mar 29, 2025

What is your result with -O3?

@mimi89999
Copy link
Author

What is your result with -O3?

Exactly the same

@mimi89999
Copy link
Author

Here is my benchmark. It's very primitive:

#include "blake3.h"

#include <stdio.h>
#include <string.h>

int main() {
  char hello[] = "6OMPP7PLnNz5EzdPaBR7QCcqddoaFBKhFSixrPfZDiVvtuAg7haIm66xafc9nRxlDAIlgIg7VGQw77La6dA3g2qDZyHH9OnoKgTSwfCwIujTXCnN6NSG2RbAyLf8M1YMNGvZsrhrvODEaUxwpvKgKRpVXdzt8ber6aYr9PX95De4zBjHBuGaPh2YdmnYyPhf5NmeHnf42UUn8R2NI7tYM4PKgucZXonqNb3e2J0Uad9TYiJ1dVIO8qsa4ZqGOEeJfKuzwRmY74rNyPWq6rHIhC6BwJk02buI3S2JxEfL0ZLnjo0gMqsFhETfj3Mrm83iwFz7oIEoMs0tGAO4BOwvNQ1vygjDHoAqRb7XDi7wvB96jlVcbo93wCzQA8xwhxjlgxxgzbXUhzq1BeFQu5ajG3QiUs4MlBrT3hoUFcHexfQg7xa39iGYd3krhdNWkahKKR3wB4O8ut71hFHXHM5JEsAGcF59gqI9qKWvTNhANr2t11n7l06CoMqDvGMmcXri";

  uint8_t output[BLAKE3_OUT_LEN];

  for (unsigned int i = 0; i < 100000000; i++) {
    blake3_hasher hasher;
    blake3_hasher_init(&hasher);

    blake3_hasher_update(&hasher, hello, strlen(hello));

    blake3_hasher_finalize(&hasher, output, BLAKE3_OUT_LEN);
  }

  for (size_t i = 0; i < BLAKE3_OUT_LEN; i++) {
    printf("%02x", output[i]);
  }
  printf("\n");

  return 0;
}
michel@debian:~/git/BLAKE3/c$ emcc -O2 test.c blake3.c blake3_dispatch.c blake3_portable.c -o test.js
michel@debian:~/git/BLAKE3/c$ time nodejs test.js 
b0e1430ecdd18b09c5834a6ddbbc741dcf01df6ed0a1552cb8a7c0eeb31de404

real	2m27,537s
user	2m27,510s
sys	0m0,037s
michel@debian:~/git/BLAKE3/c$ emcc -O2 -msimd128 test.c blake3.c blake3_dispatch.c blake3_portable.c blake3_wasm32_simd.c -o test.js
michel@debian:~/git/BLAKE3/c$ time nodejs test.js 
b0e1430ecdd18b09c5834a6ddbbc741dcf01df6ed0a1552cb8a7c0eeb31de404

real	1m56,135s
user	1m56,128s
sys	0m0,017s

@monoid
Copy link
Contributor

monoid commented Mar 29, 2025

Try moving strlen out of the for loop.

@mimi89999
Copy link
Author

@monoid That should not affect the difference between the two implementations. Could you check how it compares with the Rust WASM and Rust WASM SIMD implementations?

@monoid
Copy link
Contributor

monoid commented Mar 30, 2025

Same problem with Rust implementation. One needs longer data to see a difference. Start with something around 8-16Kb.

@mimi89999 mimi89999 changed the title WIP: Add WASM SIMD C implementation Add WASM SIMD C implementation Mar 30, 2025
@mimi89999
Copy link
Author

I see that the performance is near native on x86 compared to BLAKE3 with SSE2 only.

@monoid
Copy link
Contributor

monoid commented Mar 30, 2025

Sounds great! Meanwhile, I've managed to start a quickcheck to compare your implementation with a native one.

@mimi89999
Copy link
Author

Here are some benchmarks:

michel@debian:~/git/BLAKE3/c$ emcc -O2 -msimd128 -sSTACK_SIZE=1114112 test.c blake3.c blake3_dispatch.c blake3_portable.c blake3_wasm32_simd.c -o test.js
michel@debian:~/git/BLAKE3/c$ nodejs test.js 
b5358909f8bed53f55bf9324e290e9a5a585de8b0239d18040e9d3b0c7e8f9cf
Time: 1.950000
Data size: 1048576
Rounds: 1000
Time per round: 0.001950
michel@debian:~/git/BLAKE3/c$ emcc -O2 -sSTACK_SIZE=1114112 test.c blake3.c blake3_dispatch.c blake3_portable.c -o test.js
michel@debian:~/git/BLAKE3/c$ nodejs test.js 
b5358909f8bed53f55bf9324e290e9a5a585de8b0239d18040e9d3b0c7e8f9cf
Time: 2.253000
Data size: 1048576
Rounds: 1000
Time per round: 0.002253
michel@debian:~/git/BLAKE3/c$ vim test.c
michel@debian:~/git/BLAKE3/c$ emcc -O2 -msimd128 -sSTACK_SIZE=1114112 test.c blake3.c blake3_dispatch.c blake3_portable.c blake3_wasm32_simd.c -o test.js
michel@debian:~/git/BLAKE3/c$ nodejs test.js 
b5358909f8bed53f55bf9324e290e9a5a585de8b0239d18040e9d3b0c7e8f9cf
Time: 19.468000
Data size: 1048576
Rounds: 10000
Time per round: 0.001947
michel@debian:~/git/BLAKE3/c$ emcc -O2 -sSTACK_SIZE=1114112 test.c blake3.c blake3_dispatch.c blake3_portable.c -o test.js
michel@debian:~/git/BLAKE3/c$ nodejs test.js 
b5358909f8bed53f55bf9324e290e9a5a585de8b0239d18040e9d3b0c7e8f9cf
Time: 22.336000
Data size: 1048576
Rounds: 10000
Time per round: 0.002234

@mimi89999
Copy link
Author

@oconnor663 After some benchmarking, I can't replicate the huge performance improvements of using SIMD instructions on my machine. Nevertheless, it's still a 15% improvement. I think that it's worth adding it. Maybe on other machines, the difference will be much bigger. 🤔

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants