Skip to content

Conversation

@lucky-bai
Copy link
Contributor

Motivation

Candle's WASM build currently runs on a single CPU thread, which makes it significantly slower than it could be. This PR provides a working demo of multithreaded WASM support in the phi model example by integrating wasm-bindgen-rayon to leverage the existing Rayon-based parallelism in the CPU backend.

Similar libraries, such as Transformers.js, already support multithreading on CPU, so this work should help bring Candle’s WASM performance closer to parity. See also this discussion on other attempts to run the Moshi 1B STT model in WASM faster than real time.

This is an experimental but functional demo: on my MacBook Pro, running the Phi-1.5 Q4_K model, throughput improved by about 3×, from ~5 tokens/sec to ~16 tokens/sec.

Risks and Limitations

  1. The wasm-bindgen-rayon dependency requires several Rust features that are not yet available on the stable branch, so the toolchain only works on the nightly Rust build.
  2. It also requires the hosting server to send specific COOP/COEP headers in order to enable the SharedArrayBuffer needed for multithreading. This necessitates workarounds to load external resources like Tailwind from CDN that would otherwise be blocked.

Despite these limitations, adding multithreading to a WASM model is feasible with minimal code changes, and the performance gains are substantial, so I think it would be worth adding support officially under some kind of experimental feature flag.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant