Skip to content

Conversation

@TheoLee72
Copy link

@TheoLee72 TheoLee72 commented Dec 8, 2025

Summary

Adds support for quantized Qwen3 MoE (Mixture of Experts) models to Candle. Implements efficient sparse MoE inference with top-k expert routing and weighted aggregation, supporting both 30B-A3B and 235B-A22B variants.

Changes

Core Model Implementation

  • Sparse MoE architecture: Implements 128-expert MoE with top-8 routing per token using SparseMoeBlockWeights
  • Weighted expert aggregation: Router computes softmax probabilities for expert selection, applies weighted sum with routing_weights
  • Hybrid layers: Supports both MoE and dense MLP layers via MoeOrMlpWeights enum based on decoder_sparse_step
  • GGUF metadata parsing: Extracts Qwen3-specific config (expert_count, expert_used_count, expert_feed_forward_length) from GGUF files
  • Quantization support: Leverages existing QMatMul and indexed_moe_forward for efficient quantized inference

Technical Details

  • Router precision: Converts inputs to F32 for router forward pass to ensure numerical stability in softmax
  • Expert selection: Uses arg_sort_last_dim + narrow to extract top-k expert indices
  • Runtime tuning: num_experts_per_tok can be overridden at inference time for speed/quality tradeoff

Motivation

Qwen3 MoE models offer superior efficiency compared to dense models by activating only 3.3B/22B parameters per token while maintaining 30B/235B total capacity. Quantized GGUF format (Q4_K_M) enables running these models on consumer hardware. This implementation fills a gap in Candle's model zoo for efficient large-scale MoE inference.

Breaking Changes

None - this is a new model addition with no modifications to existing APIs.
I only add "indexed_moe_forward" for QMatMul wrapper.

Validation

Functional Testing

30B Model (CPU):

cargo run --example quantized-qwen3-moe --release -- \
  --prompt "Write a Rust function to calculate the factorial of a given number."

235B Model (CUDA):

cargo run --example quantized-qwen3-moe --release --features cuda -- \
  --which 235b \
  --prompt "Explain the difference between MoE and dense architectures."

Expected output: Valid Qwen3-style responses with token/sec metrics printed.


References:

@TheoLee72 TheoLee72 changed the title create quantized qwen3 moe module add quantized qwen3 moe module Dec 8, 2025
@TheoLee72 TheoLee72 marked this pull request as draft December 8, 2025 15:52
@TheoLee72 TheoLee72 marked this pull request as ready for review December 9, 2025 16:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants