[Xlite][Feature] Xlite support glm4.7 w8a8 quantization by wwwumr · Pull Request #9415 · vllm-project/vllm-ascend

wwwumr · 2026-05-21T07:20:06Z

What this PR does / why we need it?

This PR adds support for w8a8 quantization of the GLM-4.7 model to the xlite module. Specifically, it introduces shared expert weight processing and refactors the w8a8 weights of xlite, making subsequent adaptation easier.

GLM-4.7-W8A8-floatmtp TPS 910B3(A2) Online Inference Performance Comparison

Report Generated Time: 2026-05-22 01:44:06
test cases: 8K input & 1K output
diff: Performance comparison between xlite-decode-only and aclgraph

maxconcurrency	item	TTFT(ms)		TPOT(ms)		QPS (req/s)	OutputSpeed (token/s)
		Avg	P99	Avg	P99
1	baseline-aclgraph	701.32	892.20	55.25	56.19	0.02	17.91
1	xlite-decode-only	707.03	945.13	31.17	31.73	0.03	31.45
1	diff	0.81%	5.93%	-43.58%	-43.53%	50.00%	75.60%

16	baseline-aclgraph	42040.55	72478.46	66.36	87.56	0.14	143.31
16	xlite-decode-only	34304.14	53305.69	48.39	60.31	0.18	188.94
16	diff	-18.40%	-26.45%	-27.08%	-31.12%	28.57%	31.84%

32	baseline-aclgraph	149544.22	184509.32	66.27	82.37	0.14	144.38
32	xlite-decode-only	116714.20	140350.03	48.92	60.11	0.18	189.10
32	diff	-21.95%	-23.93%	-26.18%	-27.02%	28.57%	30.97%

48	baseline-aclgraph	255018.51	292790.10	65.86	80.32	0.14	145.36
48	xlite-decode-only	199491.20	226679.92	49.20	59.04	0.18	188.50
48	diff	-21.77%	-22.58%	-25.30%	-26.49%	28.57%	29.68%

64	baseline-aclgraph	361231.53	408599.45	66.33	84.19	0.14	145.39
64	xlite-decode-only	278208.69	313057.67	49.06	61.96	0.19	190.18
64	diff	-22.98%	-23.38%	-26.04%	-26.40%	35.71%	30.81%

Does this PR introduce any user-facing change?

Yes, it enables Xlite acceleration for GLM-4.7 model in w8a8 quantization mode on Ascend NPUs.

How was this patch tested?

vLLM version: v0.19.1
vLLM main: vllm-project/vllm@b1388b1

github-actions · 2026-05-21T07:20:35Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

gemini-code-assist · 2026-05-21T07:21:10Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enables w8a8 quantization support for the GLM-4.7 model within the xlite acceleration framework for Ascend NPUs. The changes focus on refactoring the weight initialization logic into a reusable helper method, which simplifies the integration of new model architectures and improves maintainability of the quantization pipelines.

Highlights

GLM-4.7 W8A8 Quantization Support: Added support for w8a8 quantization of the GLM-4.7 model to the xlite module, enabling acceleration on Ascend NPUs.
Weight Processing Refactoring: Introduced a centralized init_matmul_weights method to handle both quantized and unquantized weight initialization, significantly reducing code duplication.
Utility Improvements: Added rgetattr to vllm_ascend/xlite/utils.py to support robust recursive attribute access for model parameters.

New Features

🧠 You can now enable Memory (public preview) to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request refactors weight initialization in the xlite module by introducing the init_matmul_weights helper and rgetattr utility, which simplifies quantization parameter handling for Llama and Qwen MoE models. Feedback focuses on optimizing performance by eliminating redundant attribute lookups and function calls within list comprehensions using the walrus operator. Additionally, the reviewer recommends unifying the new attribute traversal utility with existing ones to reduce redundancy and suggests following the repository's specific style guide for the PR title and summary.

Signed-off-by: wangxiaoran <wangxiaoran11@huawei.com>

wwwumr requested a review from wangxiyuan as a code owner May 21, 2026 07:20

wwwumr force-pushed the glm4.7 branch 2 times, most recently from a6b71d8 to 67eb1a9 Compare May 21, 2026 07:21

gemini-code-assist Bot reviewed May 21, 2026

View reviewed changes

wwwumr force-pushed the glm4.7 branch 4 times, most recently from 9f997f8 to 67e0cc1 Compare May 21, 2026 08:05

wwwumr changed the title ~~[Feature] Xlite support glm4.7 w8a8 quantization~~ [Xlite][Feature] Xlite support glm4.7 w8a8 quantization May 21, 2026

wwwumr force-pushed the glm4.7 branch 8 times, most recently from e83654f to 1722c77 Compare May 22, 2026 08:48

[Xlite][Feature] Xlite support glm4.7 w8a8 quantization

1c851b8

Signed-off-by: wangxiaoran <wangxiaoran11@huawei.com>

wwwumr force-pushed the glm4.7 branch from 1722c77 to 1c851b8 Compare May 22, 2026 08:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Xlite][Feature] Xlite support glm4.7 w8a8 quantization#9415

[Xlite][Feature] Xlite support glm4.7 w8a8 quantization#9415
wwwumr wants to merge 1 commit into
vllm-project:mainfrom
wwwumr:glm4.7

wwwumr commented May 21, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 21, 2026

Uh oh!

gemini-code-assist Bot commented May 21, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

wwwumr commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

GLM-4.7-W8A8-floatmtp TPS 910B3(A2) Online Inference Performance Comparison

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

github-actions Bot commented May 21, 2026

Uh oh!

gemini-code-assist Bot commented May 21, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

wwwumr commented May 21, 2026 •

edited

Loading