Unable to quantize Qwen 2.5

(venv) G:\Projects\quark9\examples\torch\language_modeling\llm_ptq>python quantize_quark.py --model_dir "G:\Models\Qwen2.5-7B-Instruct"  --output_dir "G:\Models\Qwen2.5-7B-Instruct-quark" --quant_scheme w_uint4_per_group_asym --num_calib_data 128 --quant_algo awq --dataset pileval_for_awq_benchmark --model_export hf_format --data_type float16 --exclude_layers --skip_evaluation --quant_algo_config_file_path models\qwen2\qwen2_7b_awq_config.json

[QUARK-INFO]: C++ kernel compilation check start.

[QUARK-INFO]: C++ kernel build directory C:\Users\<user>\AppData\Local\torch_extensions\torch_extensions\Cache\py312_cu128\kernel_ext

[QUARK-INFO]: C++ kernel loading. First-time compilation may take a few minutes...

[QUARK-INFO]: C++ kernel compilation is already complete. Ending the C++ kernel compilation check. Total time: 1.7611 seconds
W0821 02:34:36.798000 26264 Lib\site-packages\torch\distributed\elastic\multiprocessing\redirects.py:29] NOTE: Redirects are currently not supported in Windows or MacOs.

[INFO]: Loading model ...
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████| 4/4 [00:06<00:00,  1.63s/it]
Initializing tokenizer from G:\Models\Qwen2.5-7B-Instruct

[INFO]: Loading dataset ...
Repo card metadata block was not found. Setting CardData to empty.
Token indices sequence length is longer than the specified maximum sequence length for this model (1034 > 512). Running this sequence through the model will result in indexing errors
[INFO-Warning] For AWQ benchmark, batch_size should be 59. Changing batch_size to 59.

[QUARK-INFO]: Configuration checking start.

[QUARK-INFO]: Configuration checking end. The configuration is effective. This is weight only quantization.

[QUARK-INFO]: Quantizing with the quantization configuration:
Config(
    global_quant_config=QuantizationConfig(
        input_tensors=None,
        output_tensors=None,
        weight=QuantizationSpec(
            dtype=Dtype.uint4,
            observer_cls=<class 'quark.torch.quantization.observer.observer.PerGroupMinMaxObserver'>,
            is_dynamic=False,
            qscheme=QSchemeType.per_group,
            ch_axis=1,
            group_size=128,
            symmetric=False,
            round_method=RoundType.half_even,
            scale_type=ScaleType.float,
            scale_format=None,
            scale_calculation_mode=None,
            qat_spec=None,
            mx_element_dtype=None,
            zero_point_type=ZeroPointType.int32,
            is_scale_quant=False,
        ),
        bias=None,
        target_device=None,
    ),
    layer_type_quant_config={},
    layer_quant_config={},
    kv_cache_quant_config={},
    softmax_quant_spec=None,
    exclude=[
    ],
    algo_config=AWQConfig(
        name="awq",
        scaling_layers=[{'prev_op': 'input_layernorm', 'layers': ['self_attn.q_proj', 'self_attn.k_proj', 'self_attn.v_proj'], 'inp': 'self_attn.q_proj', 'module2inspect': 'self_attn'}, {'prev_op': 'self_attn.v_proj', 'layers': ['self_attn.o_proj'], 'inp': 'self_attn.o_proj'}, {'prev_op': 'post_attention_layernorm', 'layers': ['mlp.gate_proj', 'mlp.up_proj'], 'inp': 'mlp.gate_proj', 'module2inspect': 'mlp'}, {'prev_op': 'mlp.up_proj', 'layers': ['mlp.down_proj'], 'inp': 'mlp.down_proj'}],
        model_decoder_layers="model.layers",
        num_attention_heads=28,
        num_key_value_heads=4,
    ),
    pre_quant_opt_config=[
    ],
    quant_mode=QuantizationMode.eager_mode,
    log_severity_level=1,
    version="0.9",
)

[QUARK-WARNING]: Lack of specific information of pre-optimization configuration. However, PyTorch version 2.8.0+cu128 detected. Only torch versions between 2.2 and 2.4 support auto generating algorithms configuration.

[QUARK-INFO]: In-place OPs replacement start.
100%|██████████████████████████████████████████████████████████████████████████████| 399/399 [00:00<00:00, 2949.01it/s]

[QUARK-INFO]: Module replacement for quantization summary:
|            Original module             |  Number original   |  Number replaced   |
|                 Conv2d                 |         0          |         0          |
|                 Linear                 |        197         |        197         |
|            ConvTranspose2d             |         0          |         0          |
|               Embedding                |         1          |         0          |
|              EmbeddingBag              |         0          |         0          |
|            Qwen2ForCausalLM            |         1          |         0          |
|               Qwen2Model               |         1          |         0          |
|               ModuleList               |         1          |         0          |
|           Qwen2DecoderLayer            |         28         |         0          |
|             Qwen2Attention             |         28         |         0          |
|          Qwen2RotaryEmbedding          |         29         |         0          |
|                Qwen2MLP                |         28         |         0          |
|                  SiLU                  |         28         |         0          |
|              Qwen2RMSNorm              |         57         |         0          |


[QUARK-INFO]: In-place OPs replacement end.

[QUARK-INFO]: Advanced algorithm start.
AWQ:  96%|██████████████████████████████████████████████████████████████████████████▎  | 27/28 [07:47<00:17, 17.31s/it]
Traceback (most recent call last):
  File "G:\Projects\quark9\examples\torch\language_modeling\llm_ptq\quantize_quark.py", line 269, in <module>
    main(args)
  File "G:\Projects\quark9\examples\torch\language_modeling\llm_ptq\quantize_quark.py", line 104, in main
    model = quantizer.quantize_model(model, calib_dataloader)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "G:\Projects\quark9\venv\Lib\site-packages\quark\torch\quantization\api.py", line 167, in quantize_model
    model = self._apply_advanced_quant_algo(model, dataloader)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "G:\Projects\quark9\venv\Lib\site-packages\quark\torch\quantization\api.py", line 275, in _apply_advanced_quant_algo
    return apply_advanced_quant_algo(model, self.config, self._is_accelerate, dataloader)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "G:\Projects\quark9\venv\Lib\site-packages\torch\utils\_contextlib.py", line 120, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "G:\Projects\quark9\venv\Lib\site-packages\quark\torch\algorithm\api.py", line 90, in apply_advanced_quant_algo
    quantizer.apply()
  File "G:\Projects\quark9\venv\Lib\site-packages\quark\torch\algorithm\awq\awq.py", line 77, in apply
    scales_list = [self._search_best_scale(self.modules[i], **layer) for layer in module_config]
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "G:\Projects\quark9\venv\Lib\site-packages\torch\utils\_contextlib.py", line 120, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "G:\Projects\quark9\venv\Lib\site-packages\quark\torch\algorithm\awq\awq.py", line 145, in _search_best_scale
    best_scales = self._compute_best_scale(inp, w_max, x_max, module2inspect, layers, fp16_output, kwargs)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "G:\Projects\quark9\venv\Lib\site-packages\quark\torch\algorithm\awq\awq.py", line 202, in _compute_best_scale
    raise Exception
Exception

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Unable to quantize Qwen 2.5 #5

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Unable to quantize Qwen 2.5 #5

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions