[Draft] Enable CPU multithreading in WASM with Rayon #3063
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Motivation
Candle's WASM build currently runs on a single CPU thread, which makes it significantly slower than it could be. This PR provides a working demo of multithreaded WASM support in the
phimodel example by integratingwasm-bindgen-rayonto leverage the existing Rayon-based parallelism in the CPU backend.Similar libraries, such as Transformers.js, already support multithreading on CPU, so this work should help bring Candle’s WASM performance closer to parity. See also this discussion on other attempts to run the Moshi 1B STT model in WASM faster than real time.
This is an experimental but functional demo: on my MacBook Pro, running the Phi-1.5 Q4_K model, throughput improved by about 3×, from ~5 tokens/sec to ~16 tokens/sec.
Risks and Limitations
wasm-bindgen-rayondependency requires several Rust features that are not yet available on the stable branch, so the toolchain only works on the nightly Rust build.SharedArrayBufferneeded for multithreading. This necessitates workarounds to load external resources like Tailwind from CDN that would otherwise be blocked.Despite these limitations, adding multithreading to a WASM model is feasible with minimal code changes, and the performance gains are substantial, so I think it would be worth adding support officially under some kind of experimental feature flag.