-
Notifications
You must be signed in to change notification settings - Fork 75
enable glm4_moe_lite quantization & generation #1321
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: WeiweiZhang1 <weiwei1.zhang@intel.com>
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR enables quantization and generation support for the GLM-4.7-Flash (glm4_moe_lite) model by implementing a custom MoE module replacement and adding comprehensive test coverage.
Changes:
- Implemented
LinearGlm4MoeLiteMoEreplacement module for calibration and quantization of the GLM-4 MoE architecture - Added test fixtures and test cases for glm4_moe_lite in both CPU and CUDA test suites
- Registered the new module type in the replacement modules registry
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| auto_round/modelling/glm4_moe_lite.py | New module implementing MoE layer replacement for GLM-4 model quantization |
| auto_round/modelling/replace_modules.py | Registered Glm4MoeLiteMoE in the module replacement registry |
| test/test_cuda/models/test_moe_model.py | Added glm4_moe_lite test fixture and comprehensive test including VLLM integration |
| test/test_cpu/models/test_moe_model.py | Added glm4_moe_lite test fixture and CPU-specific quantization test |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| self.shared_experts = original.shared_experts | ||
|
|
||
| def forward(self, hidden_states): | ||
| residuals = hidden_states |
Copilot
AI
Jan 22, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Trailing whitespace on this line should be removed to maintain consistent code formatting.
| residuals = hidden_states |
| for expert_idx, expert in enumerate(self.experts): | ||
| mask = expert_mask[expert_idx] | ||
| token_indices, weight_indices = torch.where(mask) | ||
| has_tokens = token_indices.numel() > 0 |
Copilot
AI
Jan 22, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Trailing whitespace on this line should be removed to maintain consistent code formatting.
| _update_parameter(self[i].up_proj, "weight", up_proj) | ||
| _update_parameter(self[i].down_proj, "weight", down) |
Copilot
AI
Jan 22, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing space after comma in slice notation. Should be [:intermediate_size, :] and [intermediate_size:, :] for consistency with PEP 8.
| _update_parameter(self[i].up_proj, "weight", up_proj) | |
| _update_parameter(self[i].down_proj, "weight", down) | |
| gate_proj = gate_up[:intermediate_size, :] | |
| up_proj = gate_up[intermediate_size:, :] |
| @pytest.fixture | ||
| def setup_glm4_moe_lite(): | ||
| """Fixture to set up the glm4_moe_lite model and tokenizer.""" | ||
| model_name = "/dataset/GLM-4.7-Flash/" |
Copilot
AI
Jan 22, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This hardcoded absolute path differs from the CPU test fixture which uses get_model_path(). Consider using the same pattern for consistency: model_name = get_model_path(\"zai-org/GLM-4.7-Flash\") as seen in the CPU test file.
| for output in outputs: | ||
| prompt = output.prompt | ||
| generated_text = output.outputs[0].text | ||
| # if "France" in prompt: |
Copilot
AI
Jan 22, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This commented-out code should be removed as it appears to be debug code that is no longer needed.
| # if "France" in prompt: |
| config: "Glm4MoeLiteConfig", | ||
| calibrate_all_experts: bool = False, | ||
| ): | ||
| super().__init__() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @WeiweiZhang1, just a heads‑up that https://github.com/intel/auto-round/pull/1307/changes has been merged.
Please refer to the new https://github.com/intel/auto-round/blob/main/auto_round/modelling/qwen3_vl_moe.py as an example.
To adapt to this change, we need to:
- Pass
originalto ReplacementModuleBase.
| with torch.device(target_device): | ||
| super().__init__([Glm4MoeLiteMLP(config, intermediate_size) for _ in range(self.num_experts)]) | ||
|
|
||
| if not unsupported_meta_device(original): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Move this part explicitly into the
_materialize_weightsfunction.
| gate_proj = gate_up[:intermediate_size, :] | ||
| up_proj = gate_up[intermediate_size:, :] | ||
|
|
||
| _update_parameter(self[i].gate_proj, "weight", gate_proj) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Replace the
_update_parameterwithfrom auto_round.modelling.utils import _update_parameter
Description
Type of Change
Related Issues
Fixes #
Relates to #
Changes Made
Testing
Checklist
Additional Context