[model_free_ptq] split fused moe experts to ensure quantization by liwei109 · Pull Request #2464 · vllm-project/llm-compressor

liwei109 · 2026-03-11T05:22:28Z

SUMMARY:
This PR adds a split_fused_moe_experts function for model-free quantization. This ensures that models with fused MoE layers (e.g., Qwen3.5 and Qwen3-VL) containing fused gate_up_proj and down_proj weights can be effectively quantized.

TEST PLAN:
We add an example of Qwen3.5 for testing purpose.

github-actions · 2026-03-11T05:22:36Z

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed.

gemini-code-assist · 2026-03-11T05:22:41Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances the model-free post-training quantization (PTQ) capabilities by addressing a specific challenge with Mixture-of-Experts (MoE) models that have fused expert weights. By introducing a mechanism to split these fused weights into their individual components, the PR ensures that such models can be accurately and effectively quantized, expanding the range of models supported by the quantization framework.

Highlights

MoE Expert Splitting: Introduced a new split_fused_moe_experts function to correctly handle and split fused Mixture-of-Experts (MoE) layer weights (specifically gate_up_proj and down_proj) from 3D tensors into individual 2D expert tensors. This is crucial for models like Qwen3.5 and Qwen3-VL to enable proper quantization.
Model-Free Quantization Integration: Integrated the split_fused_moe_experts function into the model_free_ptq process, ensuring that fused MoE layers are pre-processed before quantization, thereby enabling effective W8A8 quantization for these models.
Qwen3.5 Quantization Example: Added a new example script for quantizing the Qwen3.5-35B-A3B model using the model_free_ptq entrypoint with the W8A8 scheme, demonstrating the practical application of the new MoE expert splitting logic.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

examples/model_free_ptq/qwen3.5_int8.py
- Added a new example script for quantizing the Qwen3.5-35B-A3B model using W8A8 scheme.
src/llmcompressor/entrypoints/model_free/process.py
- Added the split_fused_moe_experts function to handle and split fused MoE layer weights.
- Integrated the split_fused_moe_experts function into the process_file function to apply the splitting before quantization.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a mechanism to split fused MoE experts, which is a necessary step for quantizing models like Qwen3.5. The approach is sound, but I've found a few critical bugs in the implementation of split_fused_moe_experts. Specifically, the logic for generating new tensor names is incorrect and will lead to invalid keys. Additionally, there's a case where a tensor could be inadvertently dropped. I've also included some suggestions to improve code clarity and logging practices. Please address the identified bugs to ensure the feature works as expected.

src/llmcompressor/entrypoints/model_free/process.py

HDCharles

see comments and bot comments, did you run the test successfully, it looks like the x.weight name bug should have produced an error, no?

mergify · 2026-03-11T14:08:13Z

The quality checks have failed. Please run make style and make quality under
the root directory to adddress the lint failures. You will need to install the
dev optional install to get the required linting packages:
https://github.com/vllm-project/llm-compressor/blob/main/CONTRIBUTING.md

liwei109 · 2026-03-23T06:20:44Z

see comments and bot comments, did you run the test successfully, it looks like the x.weight name bug should have produced an error, no?

Yes, I have test this on Qwen3.5 and Qwen3-VL series models, and all of these works well.

liwei109 · 2026-03-30T04:08:44Z

@HDCharles This PR is ready to review. Thanks

brian-dellabetta

Thanks @liwei109 for preparing this. I agree that this logic is needed, to mimic what is done via the Qwen3.5 MoE context in llm-compressor used during oneshot. Unfortunately I don't see a super clean way to leverage that directly without having additional code like you've added here.

Will make a note to discuss internally, cc @kylesayrs

src/llmcompressor/entrypoints/model_free/process.py

mergify · 2026-03-30T21:18:04Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @liwei109.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Li Wei <[email protected]>

liwei109 requested review from HDCharles, brian-dellabetta, dsikka and kylesayrs as code owners March 11, 2026 05:22

mergify bot added the documentation Improvements or additions to documentation label Mar 11, 2026

gemini-code-assist bot reviewed Mar 11, 2026

View reviewed changes

HDCharles reviewed Mar 11, 2026

View reviewed changes

src/llmcompressor/entrypoints/model_free/process.py Outdated Show resolved Hide resolved

HDCharles reviewed Mar 11, 2026

View reviewed changes

src/llmcompressor/entrypoints/model_free/process.py Outdated Show resolved Hide resolved

HDCharles requested changes Mar 11, 2026

View reviewed changes

mergify bot added the quality-failed label Mar 11, 2026

liwei109 force-pushed the model_free branch from 85cc95f to f3a3ca0 Compare March 23, 2026 06:15

mergify bot removed the quality-failed label Mar 23, 2026

liwei109 force-pushed the model_free branch from f3a3ca0 to f6d355d Compare March 23, 2026 06:28

liwei109 requested a review from HDCharles March 23, 2026 06:31

brian-dellabetta reviewed Mar 30, 2026

View reviewed changes

src/llmcompressor/entrypoints/model_free/process.py Outdated Show resolved Hide resolved

brian-dellabetta changed the title ~~[dev] split fused moe experts to ensure quantization~~ [model_free_ptq] split fused moe experts to ensure quantization Mar 30, 2026

brian-dellabetta added model_free_ptq For any PR/issue related to the `model_free_ptq` pathway labels Mar 30, 2026

mergify bot added the needs-rebase label Mar 30, 2026

[model_free_ptq] split fused moe experts to ensure quantization

1aeb8e3

Signed-off-by: Li Wei <[email protected]>

liwei109 force-pushed the model_free branch from 28c0341 to 1aeb8e3 Compare April 1, 2026 01:57

mergify bot removed the needs-rebase label Apr 1, 2026

liwei109 requested a review from brian-dellabetta April 1, 2026 01:59

Conversation

liwei109 commented Mar 11, 2026

Uh oh!

github-actions bot commented Mar 11, 2026

Uh oh!

gemini-code-assist bot commented Mar 11, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

HDCharles left a comment

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Mar 11, 2026

Uh oh!

liwei109 commented Mar 23, 2026

Uh oh!

liwei109 commented Mar 30, 2026

Uh oh!

brian-dellabetta left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mergify bot commented Mar 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants