Support Fused MoE & Qwen3 GGUF MoE models #3221

guoqingbao · 2025-12-02T11:25:17Z

This PR introduces support for the Fused MoE kernel for both unquantized and quantized models.
Qwen3 MoE GGUF models are now fully supported by leveraging the dedicated MoE kernel developed for the Candle ecosystem.

🔧 Usage Examples

Local GGUF File

cargo run --features cuda --example quantized-qwen3-moe --release -- --model /path/Qwen3-30B-A3B-Instruct-2507-Q4_K_M.gguf

💡 Performance note: The quantized Qwen3-30B-A3B model achieves ~80 tokens/s on an A100 40GB PCIe (single GPU) in other projects such as candle-vllm and vllm.rs. Performance here may vary depending on runtime and configuration.

Load from Hugging Face

cargo run --features cuda --example quantized-qwen3-moe --release -- --which 32b_q4k --prompt "A train is travelling at 120mph, how far does it travel in 3 minutes 30 seconds?"

Available presets via --which argument:
16b_q2k, 16b_q4k, 16b_q6k, 16b_q80, 32b_q2k, 32b_q4k, 32b_q6k, 32b_q80

Unquantized Model (Fused MoE Kernel)

Run the unquantized Qwen3-32B-A3B model using the fused MoE kernel (⚠ requires ~80GB GPU memory):

cargo run --example qwen --features cuda --release -- --prompt "Write a poem about butterflies." --model "3-moe-a3b" --weight-path /data/shared/Qwen3-30B-A3B-Instruct-2507

Or load remotely:

cargo run --example qwen --features cuda --release -- --prompt "Write a poem about butterflies." --model "3-moe-a3b"

📝 Testing Status

Full inference on the unquantized Qwen3-32B-A3B model has not been completed here due to GPU memory limitations (only the first 20 layers were tested).

However, the added code path has already been verified in other projects (under multi-rank configuration) —including candle-vllm and vllm.rs—where it functions correctly.

To run full inference on unquantized 32B+ models, a multi-rank / multi-GPU example will likely be required. 🔜

guoqingbao · 2025-12-02T11:31:02Z

@ivarflakstad Could you help review this PR?

I also made a few additional changes, including exposing the device pointer of qtensor, and updating the kernel build to supports generating both the PTX and library file at the same time (with a custom bindgen_cuda crate).

I added a link to vllm.rs in the README as well (in the section for Candle-based projects) — hope that’s okay!

ivarflakstad

Thanks for this, looks great!

Only able to give a preliminary review rn. I'll go deeper as soon as I have the time 👍

candle-nn/src/moe.rs

guoqingbao · 2025-12-03T02:35:32Z

Thanks for this, looks great!

Only able to give a preliminary review rn. I'll go deeper as soon as I have the time 👍

Thanks @ivarflakstad for the timely review — I’ve fixed the typos.

Support Fused MoE & Qwen3 GGUF MoE models

ec5b158

ivarflakstad reviewed Dec 2, 2025

View reviewed changes

candle-nn/src/moe.rs Outdated Show resolved Hide resolved

candle-nn/src/moe.rs Outdated Show resolved Hide resolved

candle-nn/src/moe.rs Outdated Show resolved Hide resolved

Typo and cargo clippy fix

9e8e9f0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support Fused MoE & Qwen3 GGUF MoE models #3221

Support Fused MoE & Qwen3 GGUF MoE models #3221

Uh oh!

guoqingbao commented Dec 2, 2025

Uh oh!

guoqingbao commented Dec 2, 2025

Uh oh!

ivarflakstad left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

guoqingbao commented Dec 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Support Fused MoE & Qwen3 GGUF MoE models #3221

Are you sure you want to change the base?

Support Fused MoE & Qwen3 GGUF MoE models #3221

Uh oh!

Conversation

guoqingbao commented Dec 2, 2025

🔧 Usage Examples

Local GGUF File

Load from Hugging Face

Unquantized Model (Fused MoE Kernel)

📝 Testing Status

Uh oh!

guoqingbao commented Dec 2, 2025

Uh oh!

ivarflakstad left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

guoqingbao commented Dec 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants