Skip to content

[Misc] Remove patch files related to rotary quantization, and register new mtp/eagle3 model classes#10896

Open
wangbj127 wants to merge 1 commit into
vllm-project:mainfrom
wangbj127:main
Open

[Misc] Remove patch files related to rotary quantization, and register new mtp/eagle3 model classes#10896
wangbj127 wants to merge 1 commit into
vllm-project:mainfrom
wangbj127:main

Conversation

@wangbj127

@wangbj127 wangbj127 commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

What this PR does / why we need it?

patch_deepseek_mtp.py and patch_draft_quarot.py are deprecated. Instead, two new model classes (AscendDeepSeekMTP and AscendEagle3LlamaForCausalLM) are registered.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

  • Qwen3-32B-W8A8-QuaRot + Eagle3 acceptance rate

(APIServer pid=3191163) INFO 06-27 09:35:37 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.56, Accepted throughput: 0.47 tokens/s, Drafted throughput: 0.91 tokens/s, Accepted: 86 tokens, Drafted: 165 tokens, Per-position acceptance rate: 0.764, 0.527, 0.273, Avg Draft acceptance rate: 52.1%
(APIServer pid=3191163) INFO 06-27 09:35:47 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.33, Accepted throughput: 10.40 tokens/s, Drafted throughput: 23.40 tokens/s, Accepted: 104 tokens, Drafted: 234 tokens, Per-position acceptance rate: 0.679, 0.423, 0.231, Avg Draft acceptance rate: 44.4%
(APIServer pid=3191163) INFO 06-27 09:35:57 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.26, Accepted throughput: 9.70 tokens/s, Drafted throughput: 23.10 tokens/s, Accepted: 97 tokens, Drafted: 231 tokens, Per-position acceptance rate: 0.662, 0.429, 0.169, Avg Draft acceptance rate: 42.0%
(APIServer pid=3191163) INFO 06-27 09:36:07 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 3.13, Accepted throughput: 0.53 tokens/s, Drafted throughput: 0.75 tokens/s, Accepted: 32 tokens, Drafted: 45 tokens, Per-position acceptance rate: 0.867, 0.733, 0.533, Avg Draft acceptance rate: 71.1%

  • GLM5-W4A8-QuaRot + MTP acceptance rate

(APIServer pid=3204216) INFO 06-27 09:51:05 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 3.05, Accepted throughput: 4.30 tokens/s, Drafted throughput: 6.30 tokens/s, Accepted: 43 tokens, Drafted: 63 tokens, Per-position acceptance rate: 0.952, 0.762, 0.333, Avg Draft acceptance rate: 68.3%
(APIServer pid=3204216) INFO 06-27 09:51:15 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.86, Accepted throughput: 4.10 tokens/s, Drafted throughput: 6.60 tokens/s, Accepted: 41 tokens, Drafted: 66 tokens, Per-position acceptance rate: 0.773, 0.682, 0.409, Avg Draft acceptance rate: 62.1%
(APIServer pid=3204216) INFO 06-27 09:51:25 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.95, Accepted throughput: 4.10 tokens/s, Drafted throughput: 6.30 tokens/s, Accepted: 41 tokens, Drafted: 63 tokens, Per-position acceptance rate: 0.762, 0.619, 0.571, Avg Draft acceptance rate: 65.1%
(APIServer pid=3204216) INFO 06-27 09:51:35 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.82, Accepted throughput: 4.00 tokens/s, Drafted throughput: 6.60 tokens/s, Accepted: 40 tokens, Drafted: 66 tokens, Per-position acceptance rate: 0.818, 0.591, 0.409, Avg Draft acceptance rate: 60.6%
(APIServer pid=3204216) INFO 06-27 09:51:45 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 3.05, Accepted throughput: 4.10 tokens/s, Drafted throughput: 6.00 tokens/s, Accepted: 41 tokens, Drafted: 60 tokens, Per-position acceptance rate: 0.950, 0.650, 0.450, Avg Draft acceptance rate: 68.3%
(APIServer pid=3204216) INFO 06-27 09:51:55 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.89, Accepted throughput: 3.60 tokens/s, Drafted throughput: 5.70 tokens/s, Accepted: 36 tokens, Drafted: 57 tokens, Per-position acceptance rate: 0.789, 0.632, 0.474, Avg Draft acceptance rate: 63.2%

@wangbj127 wangbj127 requested a review from wangxiyuan as a code owner June 24, 2026 09:26
@gemini-code-assist

Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request refactors the rotary quantization support for DeepSeek MTP models by moving away from monkey-patching via a dedicated patch file. Instead, it implements cleaner, class-based overrides within the 'vllm_ascend' model directory. This change improves code modularity and reduces reliance on global state manipulation, aligning with better software engineering practices for model extensions.

Highlights

  • Refactoring Patch Implementation: Removed the monolithic 'patch_deepseek_mtp.py' file and migrated its logic into dedicated model classes within 'vllm_ascend/models/deepseek_mtp.py' for better maintainability.
  • Model Registration: Updated 'vllm_ascend/models/init.py' to register the new 'AscendDeepSeekMTP' and 'AscendGlmMoeDsaForCausalLM' classes.
  • Utility Improvements: Added a centralized 'is_rot_weight_used' utility function in 'vllm_ascend/utils.py' to handle configuration checks for rotary quantization consistently.
New Features

🧠 You can now enable Memory (public preview) to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@github-actions

Copy link
Copy Markdown
Contributor

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the codebase by removing the deprecated monkey-patching file patch_deepseek_mtp.py and instead registering the AscendDeepSeekMTP and AscendGlmMoeDsaForCausalLM models directly in the model registry. Feedback on this PR includes a suggestion to format the PR title and summary according to the repository style guide, as well as several robust weight-loading improvements: filtering out rot. weights when rotary weights are not used to prevent potential AttributeError or ValueError crashes, expanding the skip_prefixes in AscendGlmMoeDsaForCausalLM to handle fully qualified weight names, and ensuring is_rot_weight_used consistently returns a boolean value.

ModelRegistry.register_model("DeepseekV4ForCausalLM", "vllm_ascend.models.deepseek_v4:AscendDeepseekV4ForCausalLM")

ModelRegistry.register_model("DeepSeekV4MTPModel", "vllm_ascend.models.deepseek_v4_mtp:DeepSeekV4MTP")
ModelRegistry.register_model("DeepSeekMTPModel", "vllm_ascend.models.deepseek_mtp:AscendDeepSeekMTP")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

According to the Repository Style Guide, here is the suggested PR Title and PR Summary for this pull request:

Suggested PR Title:

[Ops][Misc] Delete patch files related to rotary quantization and register AscendDeepSeekMTP/AscendGlmMoeDsaForCausalLM models

Suggested PR Summary:

### What this PR does / why we need it?

This PR deletes the deprecated patch files related to rotary quantization (`patch_deepseek_mtp.py`) and registers the new `AscendDeepSeekMTP` and `AscendGlmMoeDsaForCausalLM` models directly in the model registry. This refactoring improves maintainability by replacing runtime monkey-patching with clean subclassing and proper weight loading/mapping.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

CI passed with existing tests.
References
  1. Format the PR Title and PR Summary according to the Repository Style Guide. (link)

Comment on lines +42 to +48
def load_weights(self, weights: Iterable[tuple[str, torch.Tensor]]) -> set[str]:
weights_mapper = WeightsMapper(
orig_to_new_prefix={
"rot.": f"model.layers.{self.config.num_hidden_layers}.rot."
},
)
return super().load_weights(weights_mapper.apply(weights))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

If self.is_rot_weight_used is False, self.rot is not defined. However, if the checkpoint contains rot.weight, super().load_weights will attempt to load it into self.rot, resulting in an AttributeError or ValueError. We should filter out any rot. weights when self.is_rot_weight_used is False to prevent potential crashes.

Suggested change
def load_weights(self, weights: Iterable[tuple[str, torch.Tensor]]) -> set[str]:
weights_mapper = WeightsMapper(
orig_to_new_prefix={
"rot.": f"model.layers.{self.config.num_hidden_layers}.rot."
},
)
return super().load_weights(weights_mapper.apply(weights))
def load_weights(self, weights: Iterable[tuple[str, torch.Tensor]]) -> set[str]:
if not self.is_rot_weight_used:
weights = ((name, tensor) for name, tensor in weights if not name.startswith("rot."))
weights_mapper = WeightsMapper(
orig_to_new_prefix={
"rot.": f"model.layers.{self.config.num_hidden_layers}.rot."
},
)
return super().load_weights(weights_mapper.apply(weights))

Comment on lines +57 to +60
class AscendGlmMoeDsaForCausalLM(GlmMoeDsaForCausalLM):
def load_weights(self, weights: Iterable[tuple[str, torch.Tensor]]) -> set[str]:
loader = AutoWeightsLoader(self, skip_prefixes=["rot."])
return loader.load_weights(weights)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

If the checkpoint contains model.layers.{num_hidden_layers}.rot.weight, skip_prefixes=["rot."] will not match it because it starts with "model.layers.". This will cause AutoWeightsLoader to fail with a ValueError due to unexpected keys. We should add f"model.layers.{self.config.num_hidden_layers}.rot." to skip_prefixes to ensure all variations of the rot weight are skipped.

Suggested change
class AscendGlmMoeDsaForCausalLM(GlmMoeDsaForCausalLM):
def load_weights(self, weights: Iterable[tuple[str, torch.Tensor]]) -> set[str]:
loader = AutoWeightsLoader(self, skip_prefixes=["rot."])
return loader.load_weights(weights)
class AscendGlmMoeDsaForCausalLM(GlmMoeDsaForCausalLM):
def load_weights(self, weights: Iterable[tuple[str, torch.Tensor]]) -> set[str]:
loader = AutoWeightsLoader(
self,
skip_prefixes=[
"rot.",
f"model.layers.{self.config.num_hidden_layers}.rot.",
],
)
return loader.load_weights(weights)

Comment thread vllm_ascend/utils.py
Comment on lines +1586 to +1595
def is_rot_weight_used(vllm_config: VllmConfig = None):
global _IS_ROT_WEIGHT_USED
if vllm_config is None:
from vllm.config import get_current_vllm_config_or_none

vllm_config = get_current_vllm_config_or_none()
if _IS_ROT_WEIGHT_USED is None and vllm_config is not None:
quant_description = getattr(vllm_config.quant_config, "quant_description", None)
_IS_ROT_WEIGHT_USED = quant_description.get("is_rot_used", False) if quant_description is not None else False
return _IS_ROT_WEIGHT_USED

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

If vllm_config is None (e.g., when called without config and get_current_vllm_config_or_none() also returns None), the function returns None instead of a boolean, which can cause falsy/truthy confusion or type errors. We should return False as a safe fallback.

Suggested change
def is_rot_weight_used(vllm_config: VllmConfig = None):
global _IS_ROT_WEIGHT_USED
if vllm_config is None:
from vllm.config import get_current_vllm_config_or_none
vllm_config = get_current_vllm_config_or_none()
if _IS_ROT_WEIGHT_USED is None and vllm_config is not None:
quant_description = getattr(vllm_config.quant_config, "quant_description", None)
_IS_ROT_WEIGHT_USED = quant_description.get("is_rot_used", False) if quant_description is not None else False
return _IS_ROT_WEIGHT_USED
def is_rot_weight_used(vllm_config: VllmConfig = None) -> bool:
global _IS_ROT_WEIGHT_USED
if vllm_config is None:
from vllm.config import get_current_vllm_config_or_none
vllm_config = get_current_vllm_config_or_none()
if _IS_ROT_WEIGHT_USED is None:
if vllm_config is not None:
quant_description = getattr(vllm_config.quant_config, "quant_description", None)
_IS_ROT_WEIGHT_USED = quant_description.get("is_rot_used", False) if quant_description is not None else False
else:
return False
return _IS_ROT_WEIGHT_USED

@github-actions

Copy link
Copy Markdown
Contributor

This pull request has conflicts, please resolve those before we can evaluate the pull request.

@wangbj127 wangbj127 changed the title Delete patch files related to rotary quantization [Refactor] Remove patch files related to rotary quantization, and add new mode files Jun 25, 2026
@wangbj127 wangbj127 changed the title [Refactor] Remove patch files related to rotary quantization, and add new mode files [Misc] Remove patch files related to rotary quantization, and register mtp/eagle3 models Jun 25, 2026
@wangbj127 wangbj127 force-pushed the main branch 2 times, most recently from 71bc39b to f458198 Compare June 25, 2026 01:51
@wangbj127 wangbj127 requested a review from MengqingCao as a code owner June 25, 2026 02:59
@wangbj127 wangbj127 changed the title [Misc] Remove patch files related to rotary quantization, and register mtp/eagle3 models [Misc] Remove patch files related to rotary quantization, and register new mtp/eagle3 model classes Jun 25, 2026
@Angazenn Angazenn added the ready enable e2e test for PR label Jun 25, 2026
@github-actions

Copy link
Copy Markdown
Contributor

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Signed-off-by: Wangbj127 <wangbj1207@126.com>
@github-actions

Copy link
Copy Markdown
Contributor

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants