-
Notifications
You must be signed in to change notification settings - Fork 375
Add WASM SIMD C implementation #459
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
The bad performance was caused by a small bug. Now the WASM SIMD implementation is barely faster than the portable one. |
What is your result with -O3? |
Exactly the same |
Here is my benchmark. It's very primitive: #include "blake3.h"
#include <stdio.h>
#include <string.h>
int main() {
char hello[] = "6OMPP7PLnNz5EzdPaBR7QCcqddoaFBKhFSixrPfZDiVvtuAg7haIm66xafc9nRxlDAIlgIg7VGQw77La6dA3g2qDZyHH9OnoKgTSwfCwIujTXCnN6NSG2RbAyLf8M1YMNGvZsrhrvODEaUxwpvKgKRpVXdzt8ber6aYr9PX95De4zBjHBuGaPh2YdmnYyPhf5NmeHnf42UUn8R2NI7tYM4PKgucZXonqNb3e2J0Uad9TYiJ1dVIO8qsa4ZqGOEeJfKuzwRmY74rNyPWq6rHIhC6BwJk02buI3S2JxEfL0ZLnjo0gMqsFhETfj3Mrm83iwFz7oIEoMs0tGAO4BOwvNQ1vygjDHoAqRb7XDi7wvB96jlVcbo93wCzQA8xwhxjlgxxgzbXUhzq1BeFQu5ajG3QiUs4MlBrT3hoUFcHexfQg7xa39iGYd3krhdNWkahKKR3wB4O8ut71hFHXHM5JEsAGcF59gqI9qKWvTNhANr2t11n7l06CoMqDvGMmcXri";
uint8_t output[BLAKE3_OUT_LEN];
for (unsigned int i = 0; i < 100000000; i++) {
blake3_hasher hasher;
blake3_hasher_init(&hasher);
blake3_hasher_update(&hasher, hello, strlen(hello));
blake3_hasher_finalize(&hasher, output, BLAKE3_OUT_LEN);
}
for (size_t i = 0; i < BLAKE3_OUT_LEN; i++) {
printf("%02x", output[i]);
}
printf("\n");
return 0;
}
|
Try moving strlen out of the for loop. |
@monoid That should not affect the difference between the two implementations. Could you check how it compares with the Rust WASM and Rust WASM SIMD implementations? |
Same problem with Rust implementation. One needs longer data to see a difference. Start with something around 8-16Kb. |
I see that the performance is near native on x86 compared to BLAKE3 with SSE2 only. |
Sounds great! Meanwhile, I've managed to start a quickcheck to compare your implementation with a native one. |
Here are some benchmarks:
|
@oconnor663 After some benchmarking, I can't replicate the huge performance improvements of using SIMD instructions on my machine. Nevertheless, it's still a 15% improvement. I think that it's worth adding it. Maybe on other machines, the difference will be much bigger. 🤔 |
I tested it in NodeJS and it seems 2x slower 😞Update: It's faster!