Skip to content

Unable to quantize Qwen 2.5 #5

@rwfsmith

Description

@rwfsmith

(venv) G:\Projects\quark9\examples\torch\language_modeling\llm_ptq>python quantize_quark.py --model_dir "G:\Models\Qwen2.5-7B-Instruct" --output_dir "G:\Models\Qwen2.5-7B-Instruct-quark" --quant_scheme w_uint4_per_group_asym --num_calib_data 128 --quant_algo awq --dataset pileval_for_awq_benchmark --model_export hf_format --data_type float16 --exclude_layers --skip_evaluation --quant_algo_config_file_path models\qwen2\qwen2_7b_awq_config.json

[QUARK-INFO]: C++ kernel compilation check start.

[QUARK-INFO]: C++ kernel build directory C:\Users<user>\AppData\Local\torch_extensions\torch_extensions\Cache\py312_cu128\kernel_ext

[QUARK-INFO]: C++ kernel loading. First-time compilation may take a few minutes...

[QUARK-INFO]: C++ kernel compilation is already complete. Ending the C++ kernel compilation check. Total time: 1.7611 seconds
W0821 02:34:36.798000 26264 Lib\site-packages\torch\distributed\elastic\multiprocessing\redirects.py:29] NOTE: Redirects are currently not supported in Windows or MacOs.

[INFO]: Loading model ...
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████| 4/4 [00:06<00:00, 1.63s/it]
Initializing tokenizer from G:\Models\Qwen2.5-7B-Instruct

[INFO]: Loading dataset ...
Repo card metadata block was not found. Setting CardData to empty.
Token indices sequence length is longer than the specified maximum sequence length for this model (1034 > 512). Running this sequence through the model will result in indexing errors
[INFO-Warning] For AWQ benchmark, batch_size should be 59. Changing batch_size to 59.

[QUARK-INFO]: Configuration checking start.

[QUARK-INFO]: Configuration checking end. The configuration is effective. This is weight only quantization.

[QUARK-INFO]: Quantizing with the quantization configuration:
Config(
global_quant_config=QuantizationConfig(
input_tensors=None,
output_tensors=None,
weight=QuantizationSpec(
dtype=Dtype.uint4,
observer_cls=<class 'quark.torch.quantization.observer.observer.PerGroupMinMaxObserver'>,
is_dynamic=False,
qscheme=QSchemeType.per_group,
ch_axis=1,
group_size=128,
symmetric=False,
round_method=RoundType.half_even,
scale_type=ScaleType.float,
scale_format=None,
scale_calculation_mode=None,
qat_spec=None,
mx_element_dtype=None,
zero_point_type=ZeroPointType.int32,
is_scale_quant=False,
),
bias=None,
target_device=None,
),
layer_type_quant_config={},
layer_quant_config={},
kv_cache_quant_config={},
softmax_quant_spec=None,
exclude=[
],
algo_config=AWQConfig(
name="awq",
scaling_layers=[{'prev_op': 'input_layernorm', 'layers': ['self_attn.q_proj', 'self_attn.k_proj', 'self_attn.v_proj'], 'inp': 'self_attn.q_proj', 'module2inspect': 'self_attn'}, {'prev_op': 'self_attn.v_proj', 'layers': ['self_attn.o_proj'], 'inp': 'self_attn.o_proj'}, {'prev_op': 'post_attention_layernorm', 'layers': ['mlp.gate_proj', 'mlp.up_proj'], 'inp': 'mlp.gate_proj', 'module2inspect': 'mlp'}, {'prev_op': 'mlp.up_proj', 'layers': ['mlp.down_proj'], 'inp': 'mlp.down_proj'}],
model_decoder_layers="model.layers",
num_attention_heads=28,
num_key_value_heads=4,
),
pre_quant_opt_config=[
],
quant_mode=QuantizationMode.eager_mode,
log_severity_level=1,
version="0.9",
)

[QUARK-WARNING]: Lack of specific information of pre-optimization configuration. However, PyTorch version 2.8.0+cu128 detected. Only torch versions between 2.2 and 2.4 support auto generating algorithms configuration.

[QUARK-INFO]: In-place OPs replacement start.
100%|██████████████████████████████████████████████████████████████████████████████| 399/399 [00:00<00:00, 2949.01it/s]

[QUARK-INFO]: Module replacement for quantization summary:
| Original module | Number original | Number replaced |
| Conv2d | 0 | 0 |
| Linear | 197 | 197 |
| ConvTranspose2d | 0 | 0 |
| Embedding | 1 | 0 |
| EmbeddingBag | 0 | 0 |
| Qwen2ForCausalLM | 1 | 0 |
| Qwen2Model | 1 | 0 |
| ModuleList | 1 | 0 |
| Qwen2DecoderLayer | 28 | 0 |
| Qwen2Attention | 28 | 0 |
| Qwen2RotaryEmbedding | 29 | 0 |
| Qwen2MLP | 28 | 0 |
| SiLU | 28 | 0 |
| Qwen2RMSNorm | 57 | 0 |

[QUARK-INFO]: In-place OPs replacement end.

[QUARK-INFO]: Advanced algorithm start.
AWQ: 96%|██████████████████████████████████████████████████████████████████████████▎ | 27/28 [07:47<00:17, 17.31s/it]
Traceback (most recent call last):
File "G:\Projects\quark9\examples\torch\language_modeling\llm_ptq\quantize_quark.py", line 269, in
main(args)
File "G:\Projects\quark9\examples\torch\language_modeling\llm_ptq\quantize_quark.py", line 104, in main
model = quantizer.quantize_model(model, calib_dataloader)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "G:\Projects\quark9\venv\Lib\site-packages\quark\torch\quantization\api.py", line 167, in quantize_model
model = self._apply_advanced_quant_algo(model, dataloader)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "G:\Projects\quark9\venv\Lib\site-packages\quark\torch\quantization\api.py", line 275, in _apply_advanced_quant_algo
return apply_advanced_quant_algo(model, self.config, self._is_accelerate, dataloader)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "G:\Projects\quark9\venv\Lib\site-packages\torch\utils_contextlib.py", line 120, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "G:\Projects\quark9\venv\Lib\site-packages\quark\torch\algorithm\api.py", line 90, in apply_advanced_quant_algo
quantizer.apply()
File "G:\Projects\quark9\venv\Lib\site-packages\quark\torch\algorithm\awq\awq.py", line 77, in apply
scales_list = [self._search_best_scale(self.modules[i], **layer) for layer in module_config]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "G:\Projects\quark9\venv\Lib\site-packages\torch\utils_contextlib.py", line 120, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "G:\Projects\quark9\venv\Lib\site-packages\quark\torch\algorithm\awq\awq.py", line 145, in _search_best_scale
best_scales = self._compute_best_scale(inp, w_max, x_max, module2inspect, layers, fp16_output, kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "G:\Projects\quark9\venv\Lib\site-packages\quark\torch\algorithm\awq\awq.py", line 202, in _compute_best_scale
raise Exception
Exception

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions