add quantized qwen3 moe module #3232
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
Adds support for quantized Qwen3 MoE (Mixture of Experts) models to Candle. Implements efficient sparse MoE inference with top-k expert routing and weighted aggregation, supporting both 30B-A3B and 235B-A22B variants.
Changes
Core Model Implementation
SparseMoeBlockWeightsrouting_weightsMoeOrMlpWeightsenum based ondecoder_sparse_stepexpert_count,expert_used_count,expert_feed_forward_length) from GGUF filesQMatMulandindexed_moe_forwardfor efficient quantized inferenceTechnical Details
arg_sort_last_dim+narrowto extract top-k expert indicesnum_experts_per_tokcan be overridden at inference time for speed/quality tradeoffMotivation
Qwen3 MoE models offer superior efficiency compared to dense models by activating only 3.3B/22B parameters per token while maintaining 30B/235B total capacity. Quantized GGUF format (Q4_K_M) enables running these models on consumer hardware. This implementation fills a gap in Candle's model zoo for efficient large-scale MoE inference.
Breaking Changes
None - this is a new model addition with no modifications to existing APIs.
I only add "indexed_moe_forward" for QMatMul wrapper.
Validation
Functional Testing
30B Model (CPU):
cargo run --example quantized-qwen3-moe --release -- \ --prompt "Write a Rust function to calculate the factorial of a given number."235B Model (CUDA):
cargo run --example quantized-qwen3-moe --release --features cuda -- \ --which 235b \ --prompt "Explain the difference between MoE and dense architectures."Expected output: Valid Qwen3-style responses with token/sec metrics printed.
References: