Add DeepSeekV3 support for spinquant#2465
Add DeepSeekV3 support for spinquant#2465carrot-o0o wants to merge 1 commit intovllm-project:mainfrom
Conversation
|
👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review. Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed. |
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly extends the Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces support for DeepseekV3 models within the spinquant modifier by adding a partial_hadamard transform for combined KV weights, updating mapping configurations, and dynamically building norm mappings based on model architecture. Review comments highlight several critical issues: a potential race condition due to non-thread-safe global state modification in PartialHadamardFactory, a Denial of Service vulnerability from maliciously large regexes in _build_deepseek_v3_mappings, and another Denial of Service vulnerability from possible ZeroDivisionErrors in apply_partial_transform_weight. Additionally, there are opportunities to improve code maintainability by refactoring duplicated TransformArgs creation and removing redundant assignments in PartialHadamardTransform's constructor.
| PartialHadamardFactory.qk_nope_head_dim = ( | ||
| state.model.config.qk_nope_head_dim | ||
| ) | ||
| PartialHadamardFactory.v_head_dim = state.model.config.v_head_dim |
There was a problem hiding this comment.
The SpinQuantModifier.on_start method sets model-specific configuration values (qk_nope_head_dim, v_head_dim) as class-level attributes on PartialHadamardFactory. This introduces a race condition in multi-threaded environments, as these globally shared attributes can be overwritten, leading to incorrect transformations and corrupted model weights. This non-thread-safe global state modification is a significant design issue. Consider alternatives like a context manager or threading.local to avoid this.
| k = getattr(model.config, "first_k_dense_replace", 1) | ||
| # regex matching dense layers: 0, 1, ..., k-1 | ||
| dense_re = "|".join(str(i) for i in range(k)) | ||
| # regex matching MoE layers: k, k+1, ... | ||
| # matches any number that is NOT one of the dense layer indices | ||
| moe_re = rf"(?!({'|'.join(str(i) for i in range(k))})(?:\D|$))\d+" |
There was a problem hiding this comment.
The _build_deepseek_v3_mappings function constructs regular expressions by joining a range of integers up to k, where k is retrieved from the model configuration (model.config.first_k_dense_replace). If an attacker provides a model with a maliciously large value for this configuration parameter, it can lead to the creation of extremely large regular expressions. Compiling and matching such regexes can consume excessive CPU and memory, leading to a Denial of Service (DoS).
| num_chunks = value.shape[0] // chunk_size | ||
| result = value.clone() | ||
|
|
||
| scale = torch.tensor(v_head_dim, dtype=torch.float64).sqrt() | ||
| for i in range(num_chunks): | ||
| start_idx = i * chunk_size + qk_nope_head_dim | ||
| end_idx = (i + 1) * chunk_size | ||
| result[start_idx:end_idx, :] = ( | ||
| _multihead_matmul(transform_weight.T, value[start_idx:end_idx, :]) | ||
| / scale |
There was a problem hiding this comment.
The apply_partial_transform_weight function performs division by chunk_size and scale, where chunk_size is the sum of qk_nope_head_dim and v_head_dim, and scale is the square root of v_head_dim. These values are retrieved from the model configuration. If an attacker provides a model with a configuration where qk_nope_head_dim + v_head_dim == 0 or v_head_dim == 0, it will cause a ZeroDivisionError, leading to a Denial of Service.
| apply_list = [] | ||
| apply_list.append(TransformArgs( | ||
| targets=[ | ||
| self.mappings.embedding, | ||
| self.mappings.attn_o, | ||
| *self.mappings.mlp_out, | ||
| ], | ||
| location="weight_output", | ||
| )) | ||
| if getattr(self.mappings, "attn_v_is_kv_combined", False): | ||
| apply_list.append(TransformArgs( | ||
| targets=[ | ||
| self.mappings.attn_q, | ||
| self.mappings.attn_k, | ||
| *self.mappings.mlp_in, | ||
| self.mappings.lm_head, | ||
| ], | ||
| location="weight_input", | ||
| inverse=True, | ||
| )) | ||
| else: | ||
| apply_list.append(TransformArgs( | ||
| targets=[ | ||
| self.mappings.attn_q, | ||
| self.mappings.attn_k, | ||
| self.mappings.attn_v, | ||
| *self.mappings.mlp_in, | ||
| self.mappings.lm_head, | ||
| ], | ||
| location="weight_input", | ||
| inverse=True, | ||
| )) |
There was a problem hiding this comment.
This if/else block has significant code duplication for creating the TransformArgs. This can be refactored by building the targets list first, which will make the code more concise and easier to maintain.
apply_list = [
TransformArgs(
targets=[
self.mappings.embedding,
self.mappings.attn_o,
*self.mappings.mlp_out,
],
location="weight_output",
)
]
weight_input_targets = [
self.mappings.attn_q,
self.mappings.attn_k,
]
if not getattr(self.mappings, "attn_v_is_kv_combined", False):
weight_input_targets.append(self.mappings.attn_v)
weight_input_targets.extend(self.mappings.mlp_in)
weight_input_targets.append(self.mappings.lm_head)
apply_list.append(
TransformArgs(
targets=weight_input_targets,
location="weight_input",
inverse=True,
)
)| super().__init__(weight, perm, scheme, args, module_type) | ||
| self.weight = weight | ||
| self.perm = perm | ||
| self.scheme = scheme | ||
| self.args = args | ||
| self.module_type = module_type | ||
| self.qk_nope_head_dim = qk_nope_head_dim | ||
| self.v_head_dim = v_head_dim |
There was a problem hiding this comment.
The assignments to self.weight, self.perm, self.scheme, self.args, and self.module_type are redundant as they are already handled by the super().__init__() call. You can remove them for cleaner and more maintainable code.
| super().__init__(weight, perm, scheme, args, module_type) | |
| self.weight = weight | |
| self.perm = perm | |
| self.scheme = scheme | |
| self.args = args | |
| self.module_type = module_type | |
| self.qk_nope_head_dim = qk_nope_head_dim | |
| self.v_head_dim = v_head_dim | |
| super().__init__(weight, perm, scheme, args, module_type) | |
| self.qk_nope_head_dim = qk_nope_head_dim | |
| self.v_head_dim = v_head_dim |
SUMMARY:
added support for
DeepseekV3onspinquantR1,R2,R4andtransform_type="hadamard". If this PR seems appealing, i am willing to add support for R3 and other transformation types.DeepseekV3TopkRouterasnn.Linearduring transform, sincecompressed_tensorsonly supportsnn.Linearpartial_hadamardthat applies transformation to only a "part" of a weight matrix, sinceDeepseekV3haskv_projweights merged, andR4only needs to apply tov_proj.TEST PLAN:
I tested this code snippet to ensure spinning R1,R2,R4 results in sane outputs.
kanana-2-30b-a3b-instruct-2601's model_type isdeepseek_v3but has fewer layers, making it faster to test.output results