-
Notifications
You must be signed in to change notification settings - Fork 15
Description
(venv) G:\Projects\quark9\examples\torch\language_modeling\llm_ptq>python quantize_quark.py --model_dir "G:\Models\Qwen2.5-7B-Instruct" --output_dir "G:\Models\Qwen2.5-7B-Instruct-quark" --quant_scheme w_uint4_per_group_asym --num_calib_data 128 --quant_algo awq --dataset pileval_for_awq_benchmark --model_export hf_format --data_type float16 --exclude_layers --skip_evaluation --quant_algo_config_file_path models\qwen2\qwen2_7b_awq_config.json
[QUARK-INFO]: C++ kernel compilation check start.
[QUARK-INFO]: C++ kernel build directory C:\Users<user>\AppData\Local\torch_extensions\torch_extensions\Cache\py312_cu128\kernel_ext
[QUARK-INFO]: C++ kernel loading. First-time compilation may take a few minutes...
[QUARK-INFO]: C++ kernel compilation is already complete. Ending the C++ kernel compilation check. Total time: 1.7611 seconds
W0821 02:34:36.798000 26264 Lib\site-packages\torch\distributed\elastic\multiprocessing\redirects.py:29] NOTE: Redirects are currently not supported in Windows or MacOs.
[INFO]: Loading model ...
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████| 4/4 [00:06<00:00, 1.63s/it]
Initializing tokenizer from G:\Models\Qwen2.5-7B-Instruct
[INFO]: Loading dataset ...
Repo card metadata block was not found. Setting CardData to empty.
Token indices sequence length is longer than the specified maximum sequence length for this model (1034 > 512). Running this sequence through the model will result in indexing errors
[INFO-Warning] For AWQ benchmark, batch_size should be 59. Changing batch_size to 59.
[QUARK-INFO]: Configuration checking start.
[QUARK-INFO]: Configuration checking end. The configuration is effective. This is weight only quantization.
[QUARK-INFO]: Quantizing with the quantization configuration:
Config(
global_quant_config=QuantizationConfig(
input_tensors=None,
output_tensors=None,
weight=QuantizationSpec(
dtype=Dtype.uint4,
observer_cls=<class 'quark.torch.quantization.observer.observer.PerGroupMinMaxObserver'>,
is_dynamic=False,
qscheme=QSchemeType.per_group,
ch_axis=1,
group_size=128,
symmetric=False,
round_method=RoundType.half_even,
scale_type=ScaleType.float,
scale_format=None,
scale_calculation_mode=None,
qat_spec=None,
mx_element_dtype=None,
zero_point_type=ZeroPointType.int32,
is_scale_quant=False,
),
bias=None,
target_device=None,
),
layer_type_quant_config={},
layer_quant_config={},
kv_cache_quant_config={},
softmax_quant_spec=None,
exclude=[
],
algo_config=AWQConfig(
name="awq",
scaling_layers=[{'prev_op': 'input_layernorm', 'layers': ['self_attn.q_proj', 'self_attn.k_proj', 'self_attn.v_proj'], 'inp': 'self_attn.q_proj', 'module2inspect': 'self_attn'}, {'prev_op': 'self_attn.v_proj', 'layers': ['self_attn.o_proj'], 'inp': 'self_attn.o_proj'}, {'prev_op': 'post_attention_layernorm', 'layers': ['mlp.gate_proj', 'mlp.up_proj'], 'inp': 'mlp.gate_proj', 'module2inspect': 'mlp'}, {'prev_op': 'mlp.up_proj', 'layers': ['mlp.down_proj'], 'inp': 'mlp.down_proj'}],
model_decoder_layers="model.layers",
num_attention_heads=28,
num_key_value_heads=4,
),
pre_quant_opt_config=[
],
quant_mode=QuantizationMode.eager_mode,
log_severity_level=1,
version="0.9",
)
[QUARK-WARNING]: Lack of specific information of pre-optimization configuration. However, PyTorch version 2.8.0+cu128 detected. Only torch versions between 2.2 and 2.4 support auto generating algorithms configuration.
[QUARK-INFO]: In-place OPs replacement start.
100%|██████████████████████████████████████████████████████████████████████████████| 399/399 [00:00<00:00, 2949.01it/s]
[QUARK-INFO]: Module replacement for quantization summary:
| Original module | Number original | Number replaced |
| Conv2d | 0 | 0 |
| Linear | 197 | 197 |
| ConvTranspose2d | 0 | 0 |
| Embedding | 1 | 0 |
| EmbeddingBag | 0 | 0 |
| Qwen2ForCausalLM | 1 | 0 |
| Qwen2Model | 1 | 0 |
| ModuleList | 1 | 0 |
| Qwen2DecoderLayer | 28 | 0 |
| Qwen2Attention | 28 | 0 |
| Qwen2RotaryEmbedding | 29 | 0 |
| Qwen2MLP | 28 | 0 |
| SiLU | 28 | 0 |
| Qwen2RMSNorm | 57 | 0 |
[QUARK-INFO]: In-place OPs replacement end.
[QUARK-INFO]: Advanced algorithm start.
AWQ: 96%|██████████████████████████████████████████████████████████████████████████▎ | 27/28 [07:47<00:17, 17.31s/it]
Traceback (most recent call last):
File "G:\Projects\quark9\examples\torch\language_modeling\llm_ptq\quantize_quark.py", line 269, in
main(args)
File "G:\Projects\quark9\examples\torch\language_modeling\llm_ptq\quantize_quark.py", line 104, in main
model = quantizer.quantize_model(model, calib_dataloader)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "G:\Projects\quark9\venv\Lib\site-packages\quark\torch\quantization\api.py", line 167, in quantize_model
model = self._apply_advanced_quant_algo(model, dataloader)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "G:\Projects\quark9\venv\Lib\site-packages\quark\torch\quantization\api.py", line 275, in _apply_advanced_quant_algo
return apply_advanced_quant_algo(model, self.config, self._is_accelerate, dataloader)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "G:\Projects\quark9\venv\Lib\site-packages\torch\utils_contextlib.py", line 120, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "G:\Projects\quark9\venv\Lib\site-packages\quark\torch\algorithm\api.py", line 90, in apply_advanced_quant_algo
quantizer.apply()
File "G:\Projects\quark9\venv\Lib\site-packages\quark\torch\algorithm\awq\awq.py", line 77, in apply
scales_list = [self._search_best_scale(self.modules[i], **layer) for layer in module_config]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "G:\Projects\quark9\venv\Lib\site-packages\torch\utils_contextlib.py", line 120, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "G:\Projects\quark9\venv\Lib\site-packages\quark\torch\algorithm\awq\awq.py", line 145, in _search_best_scale
best_scales = self._compute_best_scale(inp, w_max, x_max, module2inspect, layers, fp16_output, kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "G:\Projects\quark9\venv\Lib\site-packages\quark\torch\algorithm\awq\awq.py", line 202, in _compute_best_scale
raise Exception
Exception