Support Fused MoE & Qwen3 GGUF MoE models #3221
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR introduces support for the Fused MoE kernel for both unquantized and quantized models.
Qwen3 MoE GGUF models are now fully supported by leveraging the dedicated MoE kernel developed for the Candle ecosystem.
🔧 Usage Examples
Local GGUF File
Load from Hugging Face
cargo run --features cuda --example quantized-qwen3-moe --release -- --which 32b_q4k --prompt "A train is travelling at 120mph, how far does it travel in 3 minutes 30 seconds?"Available presets via
--whichargument:16b_q2k, 16b_q4k, 16b_q6k, 16b_q80, 32b_q2k, 32b_q4k, 32b_q6k, 32b_q80Unquantized Model (Fused MoE Kernel)
Run the unquantized Qwen3-32B-A3B model using the fused MoE kernel (⚠ requires ~80GB GPU memory):
Or load remotely:
📝 Testing Status
Full inference on the unquantized Qwen3-32B-A3B model has not been completed here due to GPU memory limitations (only the first 20 layers were tested).
However, the added code path has already been verified in other projects (under multi-rank configuration) —including candle-vllm and vllm.rs—where it functions correctly.
To run full inference on unquantized 32B+ models, a multi-rank / multi-GPU example will likely be required. 🔜