-
Notifications
You must be signed in to change notification settings - Fork 15
Open
Description
When trying out onnx graph export in llm_ptq examples:
RuntimeError: Only tuples, lists and Variables are supported as JIT inputs/outputs. Dictionaries and strings are also accepted, but their usage is not recommended. Here, received an input of unsupported type: DynamicCacheIs there planned feature development for onnx export with models in HF implementation? Thanks!
Full trace:
(quark) ➜ llm_ptq git:(release/0.9) ✗ python quantize_quark.py --model_dir ../../../../../models/qwen/qwen1.5-0.5b \
--quant_scheme w_fp8_a_fp8 --kv_cache_dtype fp8 --num_calib_data 128 --model_export onnx --output_dir ./qwen1.5-0.5b
[QUARK-INFO]: C++ kernel compilation check start.
[QUARK-INFO]: C++ kernel build directory /home/karam/.cache/torch_extensions/py310_cu126/kernel_ext
[QUARK-INFO]: C++ kernel loading. First-time compilation may take a few minutes...
W0911 12:19:40.967000 957122 site-packages/torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
W0911 12:19:40.967000 957122 site-packages/torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
[QUARK-INFO]: C++ kernel compilation is already complete. Ending the C++ kernel compilation check. Total time: 0.0249 seconds
[INFO]: Loading model ...
`torch_dtype` is deprecated! Use `dtype` instead!
Initializing tokenizer from ../../../../../models/qwen/qwen1.5-0.5b
[INFO]: Loading dataset ...
Repo card metadata block was not found. Setting CardData to empty.
[QUARK-INFO]: Configuration checking start.
[QUARK-INFO]: Configuration checking end. The configuration is effective. This is weight quantization and activation static quantization.
[QUARK-INFO]: Quantizing with the quantization configuration:
Config(
global_quant_config=QuantizationConfig(
input_tensors=QuantizationSpec(
dtype=Dtype.fp8_e4m3,
observer_cls=<class 'quark.torch.quantization.observer.observer.PerTensorMinMaxObserver'>,
is_dynamic=False,
qscheme=QSchemeType.per_tensor,
ch_axis=None,
group_size=None,
symmetric=None,
round_method=None,
scale_type=None,
scale_format=None,
scale_calculation_mode=None,
qat_spec=None,
mx_element_dtype=None,
zero_point_type=ZeroPointType.int32,
is_scale_quant=False,
),
output_tensors=None,
weight=QuantizationSpec(
dtype=Dtype.fp8_e4m3,
observer_cls=<class 'quark.torch.quantization.observer.observer.PerTensorMinMaxObserver'>,
is_dynamic=False,
qscheme=QSchemeType.per_tensor,
ch_axis=None,
group_size=None,
symmetric=None,
round_method=None,
scale_type=None,
scale_format=None,
scale_calculation_mode=None,
qat_spec=None,
mx_element_dtype=None,
zero_point_type=ZeroPointType.int32,
is_scale_quant=False,
),
bias=None,
target_device=None,
),
layer_type_quant_config={},
layer_quant_config={'*k_proj': QuantizationConfig(input_tensors=QuantizationSpec(dtype=<Dtype.fp8_e4m3: 'fp8_e4m3'>, observer_cls=<class 'quark.torch.quantization.observer.observer.PerTensorMinMaxObserver'>, is_dynamic=False, qscheme=<QSchemeType.per_tensor: 'per_tensor'>, ch_axis=None, group_size=None, symmetric=None, round_method=None, scale_type=None, scale_format=None, scale_calculation_mode=None, qat_spec=None, mx_element_dtype=None, zero_point_type=<ZeroPointType.int32: 'int32'>, is_scale_quant=False), output_tensors=QuantizationSpec(dtype=<Dtype.fp8_e4m3: 'fp8_e4m3'>, observer_cls=<class 'quark.torch.quantization.observer.observer.PerTensorMinMaxObserver'>, is_dynamic=False, qscheme=<QSchemeType.per_tensor: 'per_tensor'>, ch_axis=None, group_size=None, symmetric=None, round_method=None, scale_type=None, scale_format=None, scale_calculation_mode=None, qat_spec=None, mx_element_dtype=None, zero_point_type=<ZeroPointType.int32: 'int32'>, is_scale_quant=False), weight=QuantizationSpec(dtype=<Dtype.fp8_e4m3: 'fp8_e4m3'>, observer_cls=<class 'quark.torch.quantization.observer.observer.PerTensorMinMaxObserver'>, is_dynamic=False, qscheme=<QSchemeType.per_tensor: 'per_tensor'>, ch_axis=None, group_size=None, symmetric=None, round_method=None, scale_type=None, scale_format=None, scale_calculation_mode=None, qat_spec=None, mx_element_dtype=None, zero_point_type=<ZeroPointType.int32: 'int32'>, is_scale_quant=False), bias=None, target_device=None), '*v_proj': QuantizationConfig(input_tensors=QuantizationSpec(dtype=<Dtype.fp8_e4m3: 'fp8_e4m3'>, observer_cls=<class 'quark.torch.quantization.observer.observer.PerTensorMinMaxObserver'>, is_dynamic=False, qscheme=<QSchemeType.per_tensor: 'per_tensor'>, ch_axis=None, group_size=None, symmetric=None, round_method=None, scale_type=None, scale_format=None, scale_calculation_mode=None, qat_spec=None, mx_element_dtype=None, zero_point_type=<ZeroPointType.int32: 'int32'>, is_scale_quant=False), output_tensors=QuantizationSpec(dtype=<Dtype.fp8_e4m3: 'fp8_e4m3'>, observer_cls=<class 'quark.torch.quantization.observer.observer.PerTensorMinMaxObserver'>, is_dynamic=False, qscheme=<QSchemeType.per_tensor: 'per_tensor'>, ch_axis=None, group_size=None, symmetric=None, round_method=None, scale_type=None, scale_format=None, scale_calculation_mode=None, qat_spec=None, mx_element_dtype=None, zero_point_type=<ZeroPointType.int32: 'int32'>, is_scale_quant=False), weight=QuantizationSpec(dtype=<Dtype.fp8_e4m3: 'fp8_e4m3'>, observer_cls=<class 'quark.torch.quantization.observer.observer.PerTensorMinMaxObserver'>, is_dynamic=False, qscheme=<QSchemeType.per_tensor: 'per_tensor'>, ch_axis=None, group_size=None, symmetric=None, round_method=None, scale_type=None, scale_format=None, scale_calculation_mode=None, qat_spec=None, mx_element_dtype=None, zero_point_type=<ZeroPointType.int32: 'int32'>, is_scale_quant=False), bias=None, target_device=None)},
kv_cache_quant_config={'*k_proj': QuantizationConfig(input_tensors=QuantizationSpec(dtype=<Dtype.fp8_e4m3: 'fp8_e4m3'>, observer_cls=<class 'quark.torch.quantization.observer.observer.PerTensorMinMaxObserver'>, is_dynamic=False, qscheme=<QSchemeType.per_tensor: 'per_tensor'>, ch_axis=None, group_size=None, symmetric=None, round_method=None, scale_type=None, scale_format=None, scale_calculation_mode=None, qat_spec=None, mx_element_dtype=None, zero_point_type=<ZeroPointType.int32: 'int32'>, is_scale_quant=False), output_tensors=QuantizationSpec(dtype=<Dtype.fp8_e4m3: 'fp8_e4m3'>, observer_cls=<class 'quark.torch.quantization.observer.observer.PerTensorMinMaxObserver'>, is_dynamic=False, qscheme=<QSchemeType.per_tensor: 'per_tensor'>, ch_axis=None, group_size=None, symmetric=None, round_method=None, scale_type=None, scale_format=None, scale_calculation_mode=None, qat_spec=None, mx_element_dtype=None, zero_point_type=<ZeroPointType.int32: 'int32'>, is_scale_quant=False), weight=QuantizationSpec(dtype=<Dtype.fp8_e4m3: 'fp8_e4m3'>, observer_cls=<class 'quark.torch.quantization.observer.observer.PerTensorMinMaxObserver'>, is_dynamic=False, qscheme=<QSchemeType.per_tensor: 'per_tensor'>, ch_axis=None, group_size=None, symmetric=None, round_method=None, scale_type=None, scale_format=None, scale_calculation_mode=None, qat_spec=None, mx_element_dtype=None, zero_point_type=<ZeroPointType.int32: 'int32'>, is_scale_quant=False), bias=None, target_device=None), '*v_proj': QuantizationConfig(input_tensors=QuantizationSpec(dtype=<Dtype.fp8_e4m3: 'fp8_e4m3'>, observer_cls=<class 'quark.torch.quantization.observer.observer.PerTensorMinMaxObserver'>, is_dynamic=False, qscheme=<QSchemeType.per_tensor: 'per_tensor'>, ch_axis=None, group_size=None, symmetric=None, round_method=None, scale_type=None, scale_format=None, scale_calculation_mode=None, qat_spec=None, mx_element_dtype=None, zero_point_type=<ZeroPointType.int32: 'int32'>, is_scale_quant=False), output_tensors=QuantizationSpec(dtype=<Dtype.fp8_e4m3: 'fp8_e4m3'>, observer_cls=<class 'quark.torch.quantization.observer.observer.PerTensorMinMaxObserver'>, is_dynamic=False, qscheme=<QSchemeType.per_tensor: 'per_tensor'>, ch_axis=None, group_size=None, symmetric=None, round_method=None, scale_type=None, scale_format=None, scale_calculation_mode=None, qat_spec=None, mx_element_dtype=None, zero_point_type=<ZeroPointType.int32: 'int32'>, is_scale_quant=False), weight=QuantizationSpec(dtype=<Dtype.fp8_e4m3: 'fp8_e4m3'>, observer_cls=<class 'quark.torch.quantization.observer.observer.PerTensorMinMaxObserver'>, is_dynamic=False, qscheme=<QSchemeType.per_tensor: 'per_tensor'>, ch_axis=None, group_size=None, symmetric=None, round_method=None, scale_type=None, scale_format=None, scale_calculation_mode=None, qat_spec=None, mx_element_dtype=None, zero_point_type=<ZeroPointType.int32: 'int32'>, is_scale_quant=False), bias=None, target_device=None)},
softmax_quant_spec=None,
exclude=['lm_head'],
algo_config=None,
pre_quant_opt_config=[
],
quant_mode=QuantizationMode.eager_mode,
log_severity_level=1,
version="0.9+1241e27",
)
[QUARK-WARNING]: Lack of specific information of pre-optimization configuration. However, PyTorch version 2.8.0+cu126 detected. Only torch versions between 2.2 and 2.4 support auto generating algorithms configuration.
[QUARK-INFO]: In-place OPs replacement start.
[QUARK-INFO]: Module exclusion from quantization summary:
| Exclude pattern | Number of modules excluded |
| lm_head | 1 |
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 319/319 [00:00<00:00, 7366.45it/s]
[QUARK-INFO]: Module replacement for quantization summary:
| Original module | Number original | Number replaced |
| Conv2d | 0 | 0 |
| Linear | 169 | 168 |
| ConvTranspose2d | 0 | 0 |
| Embedding | 1 | 0 |
| EmbeddingBag | 0 | 0 |
| Qwen2ForCausalLM | 1 | 0 |
| Qwen2Model | 1 | 0 |
| ModuleList | 1 | 0 |
| Qwen2DecoderLayer | 24 | 0 |
| Qwen2Attention | 24 | 0 |
| Qwen2MLP | 24 | 0 |
| SiLU | 24 | 0 |
| Qwen2RMSNorm | 49 | 0 |
| Qwen2RotaryEmbedding | 1 | 0 |
[QUARK-INFO]: In-place OPs replacement end.
[QUARK-INFO]: Calibration start.
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 128/128 [00:09<00:00, 13.69it/s]
[QUARK-INFO]: Calibration end.
[QUARK-INFO]: Model quantization has been completed.
[QUARK-INFO]: Freeze model start.
[QUARK-INFO]: Freeze model end.
[INFO]: Exporting onnx graph...
[QUARK-INFO]: Start exporting quantized onnx model ...
/home/karam/miniforge3/envs/quark/lib/python3.10/site-packages/transformers/masking_utils.py:521: TracerWarning: torch.tensor results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect.
mask = torch.where(mask, torch.tensor(0.0, device=mask.device, dtype=dtype), min_dtype)
/home/karam/miniforge3/envs/quark/lib/python3.10/site-packages/transformers/cache_utils.py:92: TracerWarning: torch.tensor results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect.
self.keys = torch.tensor([], dtype=self.dtype, device=self.device)
/home/karam/miniforge3/envs/quark/lib/python3.10/site-packages/transformers/cache_utils.py:93: TracerWarning: torch.tensor results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect.
self.values = torch.tensor([], dtype=self.dtype, device=self.device)
Traceback (most recent call last):
File "/home/karam/work/projects/19_quark/Quark/examples/torch/language_modeling/llm_ptq/quantize_quark.py", line 269, in <module>
main(args)
File "/home/karam/work/projects/19_quark/Quark/examples/torch/language_modeling/llm_ptq/quantize_quark.py", line 152, in main
exporter.export_onnx_model(model, input_args, uint4_int4_flag=uint4_int4_flag)
File "/home/karam/work/projects/19_quark/Quark/quark/torch/export/api.py", line 287, in export_onnx_model
torch.onnx.export(model.eval(),
File "/home/karam/miniforge3/envs/quark/lib/python3.10/site-packages/torch/onnx/__init__.py", line 424, in export
export(
File "/home/karam/miniforge3/envs/quark/lib/python3.10/site-packages/torch/onnx/utils.py", line 522, in export
_export(
File "/home/karam/miniforge3/envs/quark/lib/python3.10/site-packages/torch/onnx/utils.py", line 1457, in _export
graph, params_dict, torch_out = _model_to_graph(
File "/home/karam/miniforge3/envs/quark/lib/python3.10/site-packages/torch/onnx/utils.py", line 1080, in _model_to_graph
graph, params, torch_out, module = _create_jit_graph(model, args)
File "/home/karam/miniforge3/envs/quark/lib/python3.10/site-packages/torch/onnx/utils.py", line 964, in _create_jit_graph
graph, torch_out = _trace_and_get_graph_from_model(model, args)
File "/home/karam/miniforge3/envs/quark/lib/python3.10/site-packages/torch/onnx/utils.py", line 871, in _trace_and_get_graph_from_model
trace_graph, torch_out, inputs_states = torch.jit._get_trace_graph(
File "/home/karam/miniforge3/envs/quark/lib/python3.10/site-packages/torch/jit/_trace.py", line 1504, in _get_trace_graph
outs = ONNXTracedModule(
File "/home/karam/miniforge3/envs/quark/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/karam/miniforge3/envs/quark/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
return forward_call(*args, **kwargs)
File "/home/karam/miniforge3/envs/quark/lib/python3.10/site-packages/torch/jit/_trace.py", line 138, in forward
graph, _out = torch._C._create_graph_by_tracing(
File "/home/karam/miniforge3/envs/quark/lib/python3.10/site-packages/torch/jit/_trace.py", line 132, in wrapper
out_vars, _ = _flatten(outs)
RuntimeError: Only tuples, lists and Variables are supported as JIT inputs/outputs. Dictionaries and strings are also accepted, but their usage is not recommended. Here, received an input of unsupported type: DynamicCacheMetadata
Metadata
Assignees
Labels
No labels