enable glm4_moe_lite quantization & generation #1321

WeiweiZhang1 · 2026-01-22T14:12:19Z

Description

Type of Change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation update
Performance improvement
Code refactoring
Other (please describe):

Related Issues

Fixes #
Relates to #

Changes Made

Testing

Tested locally
Added/updated unit tests
All existing tests pass
Tested on specific hardware/environment (please specify):

Checklist

My code follows the project's coding style
I have performed a self-review of my code
I have commented my code, particularly in hard-to-understand areas
I have updated the documentation accordingly
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Additional Context

Signed-off-by: WeiweiZhang1 <weiwei1.zhang@intel.com>

for more information, see https://pre-commit.ci

Copilot

Pull request overview

This PR enables quantization and generation support for the GLM-4.7-Flash (glm4_moe_lite) model by implementing a custom MoE module replacement and adding comprehensive test coverage.

Changes:

Implemented LinearGlm4MoeLiteMoE replacement module for calibration and quantization of the GLM-4 MoE architecture
Added test fixtures and test cases for glm4_moe_lite in both CPU and CUDA test suites
Registered the new module type in the replacement modules registry

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.

File	Description
auto_round/modelling/glm4_moe_lite.py	New module implementing MoE layer replacement for GLM-4 model quantization
auto_round/modelling/replace_modules.py	Registered Glm4MoeLiteMoE in the module replacement registry
test/test_cuda/models/test_moe_model.py	Added glm4_moe_lite test fixture and comprehensive test including VLLM integration
test/test_cpu/models/test_moe_model.py	Added glm4_moe_lite test fixture and CPU-specific quantization test

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-22T14:13:06Z

auto_round/modelling/glm4_moe_lite.py

+        self.shared_experts = original.shared_experts
+
+    def forward(self, hidden_states):
+        residuals = hidden_states


Trailing whitespace on this line should be removed to maintain consistent code formatting.

Suggested change

residuals = hidden_states

Copilot · 2026-01-22T14:13:07Z

auto_round/modelling/glm4_moe_lite.py

+        for expert_idx, expert in enumerate(self.experts):
+            mask = expert_mask[expert_idx]
+            token_indices, weight_indices = torch.where(mask)
+            has_tokens = token_indices.numel() > 0


Trailing whitespace on this line should be removed to maintain consistent code formatting.

Copilot · 2026-01-22T14:13:07Z

auto_round/modelling/glm4_moe_lite.py

+                _update_parameter(self[i].up_proj, "weight", up_proj)
+                _update_parameter(self[i].down_proj, "weight", down)


Missing space after comma in slice notation. Should be [:intermediate_size, :] and [intermediate_size:, :] for consistency with PEP 8.

Suggested change

_update_parameter(self[i].up_proj, "weight", up_proj)

_update_parameter(self[i].down_proj, "weight", down)

gate_proj = gate_up[:intermediate_size, :]

up_proj = gate_up[intermediate_size:, :]

Copilot · 2026-01-22T14:13:07Z

test/test_cuda/models/test_moe_model.py

+@pytest.fixture
+def setup_glm4_moe_lite():
+    """Fixture to set up the glm4_moe_lite model and tokenizer."""
+    model_name = "/dataset/GLM-4.7-Flash/"


This hardcoded absolute path differs from the CPU test fixture which uses get_model_path(). Consider using the same pattern for consistency: model_name = get_model_path(\"zai-org/GLM-4.7-Flash\") as seen in the CPU test file.

Copilot · 2026-01-22T14:13:08Z

test/test_cuda/models/test_moe_model.py

+    for output in outputs:
+        prompt = output.prompt
+        generated_text = output.outputs[0].text
+        # if "France" in prompt:


This commented-out code should be removed as it appears to be debug code that is no longer needed.

Suggested change

# if "France" in prompt:

yiliu30 · 2026-01-26T02:23:43Z

auto_round/modelling/glm4_moe_lite.py

+        config: "Glm4MoeLiteConfig",
+        calibrate_all_experts: bool = False,
+    ):
+        super().__init__()


Hi @WeiweiZhang1, just a heads‑up that https://github.com/intel/auto-round/pull/1307/changes has been merged.
Please refer to the new https://github.com/intel/auto-round/blob/main/auto_round/modelling/qwen3_vl_moe.py as an example.

To adapt to this change, we need to:

Pass original to ReplacementModuleBase.

yiliu30 · 2026-01-26T02:26:29Z

auto_round/modelling/glm4_moe_lite.py

+        with torch.device(target_device):
+            super().__init__([Glm4MoeLiteMLP(config, intermediate_size) for _ in range(self.num_experts)])
+
+        if not unsupported_meta_device(original):


Move this part explicitly into the _materialize_weights function.

yiliu30 · 2026-01-26T02:26:58Z

auto_round/modelling/glm4_moe_lite.py

+                gate_proj = gate_up[:intermediate_size, :]
+                up_proj = gate_up[intermediate_size:, :]
+
+                _update_parameter(self[i].gate_proj, "weight", gate_proj)


Replace the _update_parameter with from auto_round.modelling.utils import _update_parameter

enable glm4_moe_lite quantization generation

cc9c259

Signed-off-by: WeiweiZhang1 <weiwei1.zhang@intel.com>

Copilot AI review requested due to automatic review settings January 22, 2026 14:12

[pre-commit.ci] auto fixes from pre-commit.com hooks

845a3ef

for more information, see https://pre-commit.ci

Copilot AI reviewed Jan 22, 2026

View reviewed changes

yiliu30 reviewed Jan 26, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

enable glm4_moe_lite quantization & generation #1321

enable glm4_moe_lite quantization & generation #1321

WeiweiZhang1 commented Jan 22, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Jan 22, 2026

Uh oh!

Copilot AI Jan 22, 2026

Uh oh!

Copilot AI Jan 22, 2026

Uh oh!

Copilot AI Jan 22, 2026

Uh oh!

Copilot AI Jan 22, 2026

Uh oh!

yiliu30 Jan 26, 2026

Uh oh!

yiliu30 Jan 26, 2026

Uh oh!

yiliu30 Jan 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		_update_parameter(self[i].up_proj, "weight", up_proj)
		_update_parameter(self[i].down_proj, "weight", down)

enable glm4_moe_lite quantization & generation #1321

Are you sure you want to change the base?

enable glm4_moe_lite quantization & generation #1321

Conversation

WeiweiZhang1 commented Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of Change

Related Issues

Changes Made

Testing

Checklist

Additional Context

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

yiliu30 Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

yiliu30 Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

yiliu30 Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

WeiweiZhang1 commented Jan 22, 2026 •

edited

Loading