add quantized qwen3 moe module #3232

TheoLee72 · 2025-12-08T15:48:31Z

Summary

Adds support for quantized Qwen3 MoE (Mixture of Experts) models to Candle. Implements efficient sparse MoE inference with top-k expert routing and weighted aggregation, supporting both 30B-A3B and 235B-A22B variants.

Changes

Core Model Implementation

Sparse MoE architecture: Implements 128-expert MoE with top-8 routing per token using SparseMoeBlockWeights
Weighted expert aggregation: Router computes softmax probabilities for expert selection, applies weighted sum with routing_weights
Hybrid layers: Supports both MoE and dense MLP layers via MoeOrMlpWeights enum based on decoder_sparse_step
GGUF metadata parsing: Extracts Qwen3-specific config (expert_count, expert_used_count, expert_feed_forward_length) from GGUF files
Quantization support: Leverages existing QMatMul and indexed_moe_forward for efficient quantized inference

Technical Details

Router precision: Converts inputs to F32 for router forward pass to ensure numerical stability in softmax
Expert selection: Uses arg_sort_last_dim + narrow to extract top-k expert indices
Runtime tuning: num_experts_per_tok can be overridden at inference time for speed/quality tradeoff

Motivation

Qwen3 MoE models offer superior efficiency compared to dense models by activating only 3.3B/22B parameters per token while maintaining 30B/235B total capacity. Quantized GGUF format (Q4_K_M) enables running these models on consumer hardware. This implementation fills a gap in Candle's model zoo for efficient large-scale MoE inference.

Breaking Changes

None - this is a new model addition with no modifications to existing APIs.
I only add "indexed_moe_forward" for QMatMul wrapper.

Validation

Functional Testing

30B Model (CPU):

cargo run --example quantized-qwen3-moe --release -- \
  --prompt "Write a Rust function to calculate the factorial of a given number."

235B Model (CUDA):

cargo run --example quantized-qwen3-moe --release --features cuda -- \
  --which 235b \
  --prompt "Explain the difference between MoE and dense architectures."

Expected output: Valid Qwen3-style responses with token/sec metrics printed.

References:

Qwen3 Blog: https://qwenlm.github.io/blog/qwen3/
Qwen3 MoE HF Docs: https://huggingface.co/docs/transformers/model_doc/qwen3_moe

* Update unary.metal * update metal unary tests * Remove metal tiled unary kernels (now automated) * Optimize metal affine * Optimize metal powf * Optimize metal elu

create quantized qwen3 moe module

a784fee

TheoLee72 changed the title ~~create quantized qwen3 moe module~~ add quantized qwen3 moe module Dec 8, 2025

TheoLee72 marked this pull request as draft December 8, 2025 15:52

ivarflakstad and others added 7 commits December 10, 2025 01:02

[Metal] unary and affine improvements (huggingface#3230)

945b688

* Update unary.metal * update metal unary tests * Remove metal tiled unary kernels (now automated) * Optimize metal affine * Optimize metal powf * Optimize metal elu

[Metal] binary improvements (huggingface#3231)

659016c

add indexed_moe_forward for QMatMul wrapper

eb1978c

make MlpWeights public

d13b74b

remove unnecessary comments

1765487

add quantized_qwen3_moe model

3095e2a

add quantized-qwen3-moe example

a531453

TheoLee72 marked this pull request as ready for review December 9, 2025 16:05

Merge branch 'main' into feature/quantized_qwen3_moe

500b147

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

add quantized qwen3 moe module #3232

add quantized qwen3 moe module #3232

Uh oh!

TheoLee72 commented Dec 8, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

add quantized qwen3 moe module #3232

Are you sure you want to change the base?

add quantized qwen3 moe module #3232

Uh oh!

Conversation

TheoLee72 commented Dec 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Core Model Implementation

Technical Details

Motivation

Breaking Changes

Validation

Functional Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

TheoLee72 commented Dec 8, 2025 •

edited

Loading