-
Notifications
You must be signed in to change notification settings - Fork 597
Open
Labels
Description
🚀 The feature, motivation and pitch
motivation
AWQ quantization is is a commonly used quantitative method, and there are many quantized models that can be used immediately, such as Qwen. Now vllm-ascend support special quantized model which is quantized by modelslim, but it will take a lot of time to quantize model and we cannot cover all the models if user want to run a quantized model.
implement
validation
| Type | Architecture | Models | Model Name | Aclgraph Mode | Accuracy | Performance | Compare to W8A8 |
|---|---|---|---|---|---|---|---|
| Text-only | DeepseekV3ForCausalLM | DeepSeek-V3 | |||||
| Text-only | DeepseekV3ForCausalLM | DeepSeek-R1 | |||||
| Text-only | Qwen2ForCausalLM | QwQ, Qwen2 | Qwen/Qwen2.5-32B-Instruct-AWQ Qwen/QwQ-32B-AWQ | ✅ | |||
| Text-only | Qwen3ForCausalLM | Qwen3 | Qwen/Qwen3-32B-AWQ | ✅ | ceval:0.85 | ||
| Text-only | Qwen3MoeForCausalLM | Qwen3MoE | billy800/Qwen3-30B-A3B-Instruct-2507-AWQ | ✅ | ceval:0.8403 | ||
| Multimodal | Qwen2AudioForConditionalGeneration | Qwen2-Audio | |||||
| Multimodal | Qwen2VLForConditionalGeneration | QVQ, Qwen2-VL | Qwen/Qwen2-VL-7B-Instruct-AWQ | ✅ | |||
| Multimodal | Qwen2_5_VLForConditionalGeneration | Qwen2.5-VL | Qwen/Qwen2.5-VL-32B-Instruct-AWQ | ❌(accuracy issue) | |||
| Multimodal | Qwen3VLForConditionalGeneration | Qwen3-VL | tclf90/Qwen3-VL-32B-Instruct-AWQ | ✅ | |||
| Multimodal | Qwen3VLMoeForConditionalGeneration | Qwen3-VL-MOE | tclf90/Qwen3-VL-30B-A3B-Instruct-AWQ | ✅ |
Alternatives
No response
Additional context
No response