Skip to content

[Xlite][Feature] Xlite support glm4.7 w8a8 quantization#9415

Open
wwwumr wants to merge 1 commit into
vllm-project:mainfrom
wwwumr:glm4.7
Open

[Xlite][Feature] Xlite support glm4.7 w8a8 quantization#9415
wwwumr wants to merge 1 commit into
vllm-project:mainfrom
wwwumr:glm4.7

Conversation

@wwwumr
Copy link
Copy Markdown
Contributor

@wwwumr wwwumr commented May 21, 2026

What this PR does / why we need it?

This PR adds support for w8a8 quantization of the GLM-4.7 model to the xlite module. Specifically, it introduces shared expert weight processing and refactors the w8a8 weights of xlite, making subsequent adaptation easier.

GLM-4.7-W8A8-floatmtp TPS 910B3(A2) Online Inference Performance Comparison

  • Report Generated Time: 2026-05-22 01:44:06
  • test cases: 8K input & 1K output
  • diff: Performance comparison between xlite-decode-only and aclgraph
maxconcurrency item TTFT(ms) TPOT(ms) QPS (req/s) OutputSpeed (token/s)
Avg P99 Avg P99
1 baseline-aclgraph 701.32 892.20 55.25 56.19 0.02 17.91
1 xlite-decode-only 707.03 945.13 31.17 31.73 0.03 31.45
1 diff 0.81% 5.93% -43.58% -43.53% 50.00% 75.60%
16 baseline-aclgraph 42040.55 72478.46 66.36 87.56 0.14 143.31
16 xlite-decode-only 34304.14 53305.69 48.39 60.31 0.18 188.94
16 diff -18.40% -26.45% -27.08% -31.12% 28.57% 31.84%
32 baseline-aclgraph 149544.22 184509.32 66.27 82.37 0.14 144.38
32 xlite-decode-only 116714.20 140350.03 48.92 60.11 0.18 189.10
32 diff -21.95% -23.93% -26.18% -27.02% 28.57% 30.97%
48 baseline-aclgraph 255018.51 292790.10 65.86 80.32 0.14 145.36
48 xlite-decode-only 199491.20 226679.92 49.20 59.04 0.18 188.50
48 diff -21.77% -22.58% -25.30% -26.49% 28.57% 29.68%
64 baseline-aclgraph 361231.53 408599.45 66.33 84.19 0.14 145.39
64 xlite-decode-only 278208.69 313057.67 49.06 61.96 0.19 190.18
64 diff -22.98% -23.38% -26.04% -26.40% 35.71% 30.81%

Does this PR introduce any user-facing change?

Yes, it enables Xlite acceleration for GLM-4.7 model in w8a8 quantization mode on Ascend NPUs.

How was this patch tested?

@wwwumr wwwumr requested a review from wangxiyuan as a code owner May 21, 2026 07:20
@github-actions
Copy link
Copy Markdown
Contributor

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

@wwwumr wwwumr force-pushed the glm4.7 branch 2 times, most recently from a6b71d8 to 67eb1a9 Compare May 21, 2026 07:21
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enables w8a8 quantization support for the GLM-4.7 model within the xlite acceleration framework for Ascend NPUs. The changes focus on refactoring the weight initialization logic into a reusable helper method, which simplifies the integration of new model architectures and improves maintainability of the quantization pipelines.

Highlights

  • GLM-4.7 W8A8 Quantization Support: Added support for w8a8 quantization of the GLM-4.7 model to the xlite module, enabling acceleration on Ascend NPUs.
  • Weight Processing Refactoring: Introduced a centralized init_matmul_weights method to handle both quantized and unquantized weight initialization, significantly reducing code duplication.
  • Utility Improvements: Added rgetattr to vllm_ascend/xlite/utils.py to support robust recursive attribute access for model parameters.
New Features

🧠 You can now enable Memory (public preview) to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors weight initialization in the xlite module by introducing the init_matmul_weights helper and rgetattr utility, which simplifies quantization parameter handling for Llama and Qwen MoE models. Feedback focuses on optimizing performance by eliminating redundant attribute lookups and function calls within list comprehensions using the walrus operator. Additionally, the reviewer recommends unifying the new attribute traversal utility with existing ones to reduce redundancy and suggests following the repository's specific style guide for the PR title and summary.

Comment thread vllm_ascend/xlite/xlite.py
Comment thread vllm_ascend/xlite/xlite.py Outdated
Comment thread vllm_ascend/xlite/xlite.py Outdated
Comment thread vllm_ascend/xlite/xlite.py Outdated
Comment thread vllm_ascend/xlite/xlite.py Outdated
Comment thread vllm_ascend/xlite/xlite.py Outdated
Comment thread vllm_ascend/xlite/utils.py Outdated
@wwwumr wwwumr force-pushed the glm4.7 branch 4 times, most recently from 9f997f8 to 67e0cc1 Compare May 21, 2026 08:05
@wwwumr wwwumr changed the title [Feature] Xlite support glm4.7 w8a8 quantization [Xlite][Feature] Xlite support glm4.7 w8a8 quantization May 21, 2026
@wwwumr wwwumr force-pushed the glm4.7 branch 8 times, most recently from e83654f to 1722c77 Compare May 22, 2026 08:48
Signed-off-by: wangxiaoran <wangxiaoran11@huawei.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant