Skip to content

Conversation

@guoqingbao
Copy link
Contributor

This PR introduces support for the Fused MoE kernel for both unquantized and quantized models.
Qwen3 MoE GGUF models are now fully supported by leveraging the dedicated MoE kernel developed for the Candle ecosystem.

🔧 Usage Examples

Local GGUF File

cargo run --features cuda --example quantized-qwen3-moe --release -- --model /path/Qwen3-30B-A3B-Instruct-2507-Q4_K_M.gguf

💡 Performance note: The quantized Qwen3-30B-A3B model achieves ~80 tokens/s on an A100 40GB PCIe (single GPU) in other projects such as candle-vllm and vllm.rs. Performance here may vary depending on runtime and configuration.

Load from Hugging Face

cargo run --features cuda --example quantized-qwen3-moe --release -- --which 32b_q4k --prompt "A train is travelling at 120mph, how far does it travel in 3 minutes 30 seconds?"

Available presets via --which argument:
16b_q2k, 16b_q4k, 16b_q6k, 16b_q80, 32b_q2k, 32b_q4k, 32b_q6k, 32b_q80

Unquantized Model (Fused MoE Kernel)

Run the unquantized Qwen3-32B-A3B model using the fused MoE kernel (⚠ requires ~80GB GPU memory):

cargo run --example qwen --features cuda --release -- --prompt "Write a poem about butterflies." --model "3-moe-a3b" --weight-path /data/shared/Qwen3-30B-A3B-Instruct-2507

Or load remotely:

cargo run --example qwen --features cuda --release -- --prompt "Write a poem about butterflies." --model "3-moe-a3b"

📝 Testing Status

Full inference on the unquantized Qwen3-32B-A3B model has not been completed here due to GPU memory limitations (only the first 20 layers were tested).

However, the added code path has already been verified in other projects (under multi-rank configuration) —including candle-vllm and vllm.rs—where it functions correctly.

To run full inference on unquantized 32B+ models, a multi-rank / multi-GPU example will likely be required. 🔜

@guoqingbao
Copy link
Contributor Author

@ivarflakstad Could you help review this PR?

I also made a few additional changes, including exposing the device pointer of qtensor, and updating the kernel build to supports generating both the PTX and library file at the same time (with a custom bindgen_cuda crate).

I added a link to vllm.rs in the README as well (in the section for Candle-based projects) — hope that’s okay!

Copy link
Member

@ivarflakstad ivarflakstad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this, looks great!

Only able to give a preliminary review rn. I'll go deeper as soon as I have the time 👍

@guoqingbao
Copy link
Contributor Author

Thanks for this, looks great!

Only able to give a preliminary review rn. I'll go deeper as soon as I have the time 👍

Thanks @ivarflakstad for the timely review — I’ve fixed the typos.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants