Skip to content

[Feature]: Add AWQ quantization support for vllm-ascend #4378

@menogrey

Description

@menogrey

🚀 The feature, motivation and pitch

motivation

AWQ quantization is is a commonly used quantitative method, and there are many quantized models that can be used immediately, such as Qwen. Now vllm-ascend support special quantized model which is quantized by modelslim, but it will take a lot of time to quantize model and we cannot cover all the models if user want to run a quantized model.

implement

validation

Type Architecture Models Model Name Aclgraph Mode Accuracy Performance Compare to W8A8
Text-only DeepseekV3ForCausalLM DeepSeek-V3
Text-only DeepseekV3ForCausalLM DeepSeek-R1
Text-only Qwen2ForCausalLM QwQ, Qwen2 Qwen/Qwen2.5-32B-Instruct-AWQ Qwen/QwQ-32B-AWQ
Text-only Qwen3ForCausalLM Qwen3 Qwen/Qwen3-32B-AWQ ceval:0.85
Text-only Qwen3MoeForCausalLM Qwen3MoE billy800/Qwen3-30B-A3B-Instruct-2507-AWQ ceval:0.8403
Multimodal Qwen2AudioForConditionalGeneration Qwen2-Audio
Multimodal Qwen2VLForConditionalGeneration QVQ, Qwen2-VL Qwen/Qwen2-VL-7B-Instruct-AWQ
Multimodal Qwen2_5_VLForConditionalGeneration Qwen2.5-VL Qwen/Qwen2.5-VL-32B-Instruct-AWQ ❌(accuracy issue)
Multimodal Qwen3VLForConditionalGeneration Qwen3-VL tclf90/Qwen3-VL-32B-Instruct-AWQ
Multimodal Qwen3VLMoeForConditionalGeneration Qwen3-VL-MOE tclf90/Qwen3-VL-30B-A3B-Instruct-AWQ

Alternatives

No response

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions