[Misc] Remove patch files related to rotary quantization, and register new mtp/eagle3 model classes by wangbj127 · Pull Request #10896 · vllm-project/vllm-ascend

wangbj127 · 2026-06-24T09:26:48Z

What this PR does / why we need it?

patch_deepseek_mtp.py and patch_draft_quarot.py are deprecated. Instead, two new model classes (AscendDeepSeekMTP and AscendEagle3LlamaForCausalLM) are registered.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Qwen3-32B-W8A8-QuaRot + Eagle3 acceptance rate

(APIServer pid=3191163) INFO 06-27 09:35:37 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.56, Accepted throughput: 0.47 tokens/s, Drafted throughput: 0.91 tokens/s, Accepted: 86 tokens, Drafted: 165 tokens, Per-position acceptance rate: 0.764, 0.527, 0.273, Avg Draft acceptance rate: 52.1%
(APIServer pid=3191163) INFO 06-27 09:35:47 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.33, Accepted throughput: 10.40 tokens/s, Drafted throughput: 23.40 tokens/s, Accepted: 104 tokens, Drafted: 234 tokens, Per-position acceptance rate: 0.679, 0.423, 0.231, Avg Draft acceptance rate: 44.4%
(APIServer pid=3191163) INFO 06-27 09:35:57 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.26, Accepted throughput: 9.70 tokens/s, Drafted throughput: 23.10 tokens/s, Accepted: 97 tokens, Drafted: 231 tokens, Per-position acceptance rate: 0.662, 0.429, 0.169, Avg Draft acceptance rate: 42.0%
(APIServer pid=3191163) INFO 06-27 09:36:07 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 3.13, Accepted throughput: 0.53 tokens/s, Drafted throughput: 0.75 tokens/s, Accepted: 32 tokens, Drafted: 45 tokens, Per-position acceptance rate: 0.867, 0.733, 0.533, Avg Draft acceptance rate: 71.1%

GLM5-W4A8-QuaRot + MTP acceptance rate

(APIServer pid=3204216) INFO 06-27 09:51:05 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 3.05, Accepted throughput: 4.30 tokens/s, Drafted throughput: 6.30 tokens/s, Accepted: 43 tokens, Drafted: 63 tokens, Per-position acceptance rate: 0.952, 0.762, 0.333, Avg Draft acceptance rate: 68.3%
(APIServer pid=3204216) INFO 06-27 09:51:15 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.86, Accepted throughput: 4.10 tokens/s, Drafted throughput: 6.60 tokens/s, Accepted: 41 tokens, Drafted: 66 tokens, Per-position acceptance rate: 0.773, 0.682, 0.409, Avg Draft acceptance rate: 62.1%
(APIServer pid=3204216) INFO 06-27 09:51:25 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.95, Accepted throughput: 4.10 tokens/s, Drafted throughput: 6.30 tokens/s, Accepted: 41 tokens, Drafted: 63 tokens, Per-position acceptance rate: 0.762, 0.619, 0.571, Avg Draft acceptance rate: 65.1%
(APIServer pid=3204216) INFO 06-27 09:51:35 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.82, Accepted throughput: 4.00 tokens/s, Drafted throughput: 6.60 tokens/s, Accepted: 40 tokens, Drafted: 66 tokens, Per-position acceptance rate: 0.818, 0.591, 0.409, Avg Draft acceptance rate: 60.6%
(APIServer pid=3204216) INFO 06-27 09:51:45 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 3.05, Accepted throughput: 4.10 tokens/s, Drafted throughput: 6.00 tokens/s, Accepted: 41 tokens, Drafted: 60 tokens, Per-position acceptance rate: 0.950, 0.650, 0.450, Avg Draft acceptance rate: 68.3%
(APIServer pid=3204216) INFO 06-27 09:51:55 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.89, Accepted throughput: 3.60 tokens/s, Drafted throughput: 5.70 tokens/s, Accepted: 36 tokens, Drafted: 57 tokens, Per-position acceptance rate: 0.789, 0.632, 0.474, Avg Draft acceptance rate: 63.2%

vLLM version: v0.23.0
vLLM main: vllm-project/vllm@967c5c3

gemini-code-assist · 2026-06-24T09:27:00Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request refactors the rotary quantization support for DeepSeek MTP models by moving away from monkey-patching via a dedicated patch file. Instead, it implements cleaner, class-based overrides within the 'vllm_ascend' model directory. This change improves code modularity and reduces reliance on global state manipulation, aligning with better software engineering practices for model extensions.

Highlights

Refactoring Patch Implementation: Removed the monolithic 'patch_deepseek_mtp.py' file and migrated its logic into dedicated model classes within 'vllm_ascend/models/deepseek_mtp.py' for better maintainability.
Model Registration: Updated 'vllm_ascend/models/init.py' to register the new 'AscendDeepSeekMTP' and 'AscendGlmMoeDsaForCausalLM' classes.
Utility Improvements: Added a centralized 'is_rot_weight_used' utility function in 'vllm_ascend/utils.py' to handle configuration checks for rotary quantization consistently.

New Features

🧠 You can now enable Memory (public preview) to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

github-actions · 2026-06-24T09:28:05Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

gemini-code-assist

Code Review

This pull request refactors the codebase by removing the deprecated monkey-patching file patch_deepseek_mtp.py and instead registering the AscendDeepSeekMTP and AscendGlmMoeDsaForCausalLM models directly in the model registry. Feedback on this PR includes a suggestion to format the PR title and summary according to the repository style guide, as well as several robust weight-loading improvements: filtering out rot. weights when rotary weights are not used to prevent potential AttributeError or ValueError crashes, expanding the skip_prefixes in AscendGlmMoeDsaForCausalLM to handle fully qualified weight names, and ensuring is_rot_weight_used consistently returns a boolean value.

gemini-code-assist · 2026-06-24T09:28:44Z

    ModelRegistry.register_model("DeepseekV4ForCausalLM", "vllm_ascend.models.deepseek_v4:AscendDeepseekV4ForCausalLM")
-
    ModelRegistry.register_model("DeepSeekV4MTPModel", "vllm_ascend.models.deepseek_v4_mtp:DeepSeekV4MTP")
+    ModelRegistry.register_model("DeepSeekMTPModel", "vllm_ascend.models.deepseek_mtp:AscendDeepSeekMTP")


According to the Repository Style Guide, here is the suggested PR Title and PR Summary for this pull request:

Suggested PR Title:

[Ops][Misc] Delete patch files related to rotary quantization and register AscendDeepSeekMTP/AscendGlmMoeDsaForCausalLM models

Suggested PR Summary:

### What this PR does / why we need it? This PR deletes the deprecated patch files related to rotary quantization (`patch_deepseek_mtp.py`) and registers the new `AscendDeepSeekMTP` and `AscendGlmMoeDsaForCausalLM` models directly in the model registry. This refactoring improves maintainability by replacing runtime monkey-patching with clean subclassing and proper weight loading/mapping. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? CI passed with existing tests.

References

Format the PR Title and PR Summary according to the Repository Style Guide. ^(link)

gemini-code-assist · 2026-06-24T09:28:44Z

+    def load_weights(self, weights: Iterable[tuple[str, torch.Tensor]]) -> set[str]:
+        weights_mapper = WeightsMapper(
+            orig_to_new_prefix={
+                "rot.": f"model.layers.{self.config.num_hidden_layers}.rot."
+            },
+        )
+        return super().load_weights(weights_mapper.apply(weights))


If self.is_rot_weight_used is False, self.rot is not defined. However, if the checkpoint contains rot.weight, super().load_weights will attempt to load it into self.rot, resulting in an AttributeError or ValueError. We should filter out any rot. weights when self.is_rot_weight_used is False to prevent potential crashes.

Suggested change

def load_weights(self, weights: Iterable[tuple[str, torch.Tensor]]) -> set[str]:

weights_mapper = WeightsMapper(

orig_to_new_prefix={

"rot.": f"model.layers.{self.config.num_hidden_layers}.rot."

},

)

return super().load_weights(weights_mapper.apply(weights))

def load_weights(self, weights: Iterable[tuple[str, torch.Tensor]]) -> set[str]:

if not self.is_rot_weight_used:

weights = ((name, tensor) for name, tensor in weights if not name.startswith("rot."))

weights_mapper = WeightsMapper(

orig_to_new_prefix={

"rot.": f"model.layers.{self.config.num_hidden_layers}.rot."

},

)

return super().load_weights(weights_mapper.apply(weights))

gemini-code-assist · 2026-06-24T09:28:44Z

+class AscendGlmMoeDsaForCausalLM(GlmMoeDsaForCausalLM):
+    def load_weights(self, weights: Iterable[tuple[str, torch.Tensor]]) -> set[str]:
+        loader = AutoWeightsLoader(self, skip_prefixes=["rot."])
+        return loader.load_weights(weights)


If the checkpoint contains model.layers.{num_hidden_layers}.rot.weight, skip_prefixes=["rot."] will not match it because it starts with "model.layers.". This will cause AutoWeightsLoader to fail with a ValueError due to unexpected keys. We should add f"model.layers.{self.config.num_hidden_layers}.rot." to skip_prefixes to ensure all variations of the rot weight are skipped.

Suggested change

class AscendGlmMoeDsaForCausalLM(GlmMoeDsaForCausalLM):

def load_weights(self, weights: Iterable[tuple[str, torch.Tensor]]) -> set[str]:

loader = AutoWeightsLoader(self, skip_prefixes=["rot."])

return loader.load_weights(weights)

class AscendGlmMoeDsaForCausalLM(GlmMoeDsaForCausalLM):

def load_weights(self, weights: Iterable[tuple[str, torch.Tensor]]) -> set[str]:

loader = AutoWeightsLoader(

self,

skip_prefixes=[

"rot.",

f"model.layers.{self.config.num_hidden_layers}.rot.",

],

)

return loader.load_weights(weights)

gemini-code-assist · 2026-06-24T09:28:45Z

+def is_rot_weight_used(vllm_config: VllmConfig = None):
+    global _IS_ROT_WEIGHT_USED
+    if vllm_config is None:
+        from vllm.config import get_current_vllm_config_or_none
+
+        vllm_config = get_current_vllm_config_or_none()
+    if _IS_ROT_WEIGHT_USED is None and vllm_config is not None:
+        quant_description = getattr(vllm_config.quant_config, "quant_description", None)
+        _IS_ROT_WEIGHT_USED = quant_description.get("is_rot_used", False) if quant_description is not None else False
+    return _IS_ROT_WEIGHT_USED


If vllm_config is None (e.g., when called without config and get_current_vllm_config_or_none() also returns None), the function returns None instead of a boolean, which can cause falsy/truthy confusion or type errors. We should return False as a safe fallback.

Suggested change

def is_rot_weight_used(vllm_config: VllmConfig = None):

global _IS_ROT_WEIGHT_USED

if vllm_config is None:

from vllm.config import get_current_vllm_config_or_none

vllm_config = get_current_vllm_config_or_none()

if _IS_ROT_WEIGHT_USED is None and vllm_config is not None:

quant_description = getattr(vllm_config.quant_config, "quant_description", None)

_IS_ROT_WEIGHT_USED = quant_description.get("is_rot_used", False) if quant_description is not None else False

return _IS_ROT_WEIGHT_USED

def is_rot_weight_used(vllm_config: VllmConfig = None) -> bool:

global _IS_ROT_WEIGHT_USED

if vllm_config is None:

from vllm.config import get_current_vllm_config_or_none

vllm_config = get_current_vllm_config_or_none()

if _IS_ROT_WEIGHT_USED is None:

if vllm_config is not None:

quant_description = getattr(vllm_config.quant_config, "quant_description", None)

_IS_ROT_WEIGHT_USED = quant_description.get("is_rot_used", False) if quant_description is not None else False

else:

return False

return _IS_ROT_WEIGHT_USED

github-actions · 2026-06-24T09:29:43Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

github-actions · 2026-06-26T10:38:08Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Signed-off-by: Wangbj127 <wangbj1207@126.com>

github-actions · 2026-06-27T06:22:57Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

wangbj127 requested a review from wangxiyuan as a code owner June 24, 2026 09:26

github-actions Bot added the module:core label Jun 24, 2026

gemini-code-assist Bot reviewed Jun 24, 2026

View reviewed changes

github-actions Bot added the merge-conflicts label Jun 24, 2026

wangbj127 changed the title ~~Delete patch files related to rotary quantization~~ [Refactor] Remove patch files related to rotary quantization, and add new mode files Jun 25, 2026

wangbj127 changed the title ~~[Refactor] Remove patch files related to rotary quantization, and add new mode files~~ [Misc] Remove patch files related to rotary quantization, and register mtp/eagle3 models Jun 25, 2026

github-actions Bot removed the merge-conflicts label Jun 25, 2026

wangbj127 force-pushed the main branch 2 times, most recently from 71bc39b to f458198 Compare June 25, 2026 01:51

wangbj127 requested a review from MengqingCao as a code owner June 25, 2026 02:59

wangbj127 force-pushed the main branch from 88bfc22 to 8fd2f11 Compare June 25, 2026 07:17

wangbj127 changed the title ~~[Misc] Remove patch files related to rotary quantization, and register mtp/eagle3 models~~ [Misc] Remove patch files related to rotary quantization, and register new mtp/eagle3 model classes Jun 25, 2026

Angazenn added the ready enable e2e test for PR label Jun 25, 2026

github-actions Bot added the merge-conflicts label Jun 26, 2026

github-actions Bot removed the merge-conflicts label Jun 26, 2026

wangbj127 closed this Jun 27, 2026

wangbj127 force-pushed the main branch from 58cd957 to 12c8da7 Compare June 27, 2026 02:09

wangbj127 reopened this Jun 27, 2026

wangbj127 force-pushed the main branch from 264ebfd to 40dddce Compare June 27, 2026 02:22

Remove patch files related to rotary quantization

853f5b7

Signed-off-by: Wangbj127 <wangbj1207@126.com>

wangbj127 force-pushed the main branch from 40dddce to 853f5b7 Compare June 27, 2026 03:09

github-actions Bot added the merge-conflicts label Jun 27, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Misc] Remove patch files related to rotary quantization, and register new mtp/eagle3 model classes#10896

[Misc] Remove patch files related to rotary quantization, and register new mtp/eagle3 model classes#10896
wangbj127 wants to merge 1 commit into
vllm-project:mainfrom
wangbj127:main

wangbj127 commented Jun 24, 2026 •

edited by github-actions Bot

Loading

Uh oh!

gemini-code-assist Bot commented Jun 24, 2026

Uh oh!

github-actions Bot commented Jun 24, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 24, 2026

Uh oh!

gemini-code-assist Bot Jun 24, 2026

Uh oh!

gemini-code-assist Bot Jun 24, 2026

Uh oh!

gemini-code-assist Bot Jun 24, 2026

Uh oh!

github-actions Bot commented Jun 24, 2026

Uh oh!

github-actions Bot commented Jun 26, 2026

Uh oh!

github-actions Bot commented Jun 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

wangbj127 commented Jun 24, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

gemini-code-assist Bot commented Jun 24, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

github-actions Bot commented Jun 24, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Jun 24, 2026

Uh oh!

github-actions Bot commented Jun 26, 2026

Uh oh!

github-actions Bot commented Jun 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

wangbj127 commented Jun 24, 2026 •

edited by github-actions Bot

Loading