Skip to content

ONNX conversion support #8

@karam-nus

Description

@karam-nus

When trying out onnx graph export in llm_ptq examples:

RuntimeError: Only tuples, lists and Variables are supported as JIT inputs/outputs. Dictionaries and strings are also accepted, but their usage is not recommended. Here, received an input of unsupported type: DynamicCache

Is there planned feature development for onnx export with models in HF implementation? Thanks!

Full trace:

(quark) ➜  llm_ptq git:(release/0.9) ✗ python quantize_quark.py --model_dir ../../../../../models/qwen/qwen1.5-0.5b \
--quant_scheme w_fp8_a_fp8 --kv_cache_dtype fp8 --num_calib_data 128 --model_export onnx --output_dir ./qwen1.5-0.5b

[QUARK-INFO]: C++ kernel compilation check start.

[QUARK-INFO]: C++ kernel build directory /home/karam/.cache/torch_extensions/py310_cu126/kernel_ext

[QUARK-INFO]: C++ kernel loading. First-time compilation may take a few minutes...
W0911 12:19:40.967000 957122 site-packages/torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
W0911 12:19:40.967000 957122 site-packages/torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.

[QUARK-INFO]: C++ kernel compilation is already complete. Ending the C++ kernel compilation check. Total time: 0.0249 seconds

[INFO]: Loading model ...
`torch_dtype` is deprecated! Use `dtype` instead!
Initializing tokenizer from ../../../../../models/qwen/qwen1.5-0.5b

[INFO]: Loading dataset ...
Repo card metadata block was not found. Setting CardData to empty.

[QUARK-INFO]: Configuration checking start.

[QUARK-INFO]: Configuration checking end. The configuration is effective. This is weight quantization and activation static quantization.

[QUARK-INFO]: Quantizing with the quantization configuration:
Config(
    global_quant_config=QuantizationConfig(
        input_tensors=QuantizationSpec(
            dtype=Dtype.fp8_e4m3,
            observer_cls=<class 'quark.torch.quantization.observer.observer.PerTensorMinMaxObserver'>,
            is_dynamic=False,
            qscheme=QSchemeType.per_tensor,
            ch_axis=None,
            group_size=None,
            symmetric=None,
            round_method=None,
            scale_type=None,
            scale_format=None,
            scale_calculation_mode=None,
            qat_spec=None,
            mx_element_dtype=None,
            zero_point_type=ZeroPointType.int32,
            is_scale_quant=False,
        ),
        output_tensors=None,
        weight=QuantizationSpec(
            dtype=Dtype.fp8_e4m3,
            observer_cls=<class 'quark.torch.quantization.observer.observer.PerTensorMinMaxObserver'>,
            is_dynamic=False,
            qscheme=QSchemeType.per_tensor,
            ch_axis=None,
            group_size=None,
            symmetric=None,
            round_method=None,
            scale_type=None,
            scale_format=None,
            scale_calculation_mode=None,
            qat_spec=None,
            mx_element_dtype=None,
            zero_point_type=ZeroPointType.int32,
            is_scale_quant=False,
        ),
        bias=None,
        target_device=None,
    ),
    layer_type_quant_config={},
    layer_quant_config={'*k_proj': QuantizationConfig(input_tensors=QuantizationSpec(dtype=<Dtype.fp8_e4m3: 'fp8_e4m3'>, observer_cls=<class 'quark.torch.quantization.observer.observer.PerTensorMinMaxObserver'>, is_dynamic=False, qscheme=<QSchemeType.per_tensor: 'per_tensor'>, ch_axis=None, group_size=None, symmetric=None, round_method=None, scale_type=None, scale_format=None, scale_calculation_mode=None, qat_spec=None, mx_element_dtype=None, zero_point_type=<ZeroPointType.int32: 'int32'>, is_scale_quant=False), output_tensors=QuantizationSpec(dtype=<Dtype.fp8_e4m3: 'fp8_e4m3'>, observer_cls=<class 'quark.torch.quantization.observer.observer.PerTensorMinMaxObserver'>, is_dynamic=False, qscheme=<QSchemeType.per_tensor: 'per_tensor'>, ch_axis=None, group_size=None, symmetric=None, round_method=None, scale_type=None, scale_format=None, scale_calculation_mode=None, qat_spec=None, mx_element_dtype=None, zero_point_type=<ZeroPointType.int32: 'int32'>, is_scale_quant=False), weight=QuantizationSpec(dtype=<Dtype.fp8_e4m3: 'fp8_e4m3'>, observer_cls=<class 'quark.torch.quantization.observer.observer.PerTensorMinMaxObserver'>, is_dynamic=False, qscheme=<QSchemeType.per_tensor: 'per_tensor'>, ch_axis=None, group_size=None, symmetric=None, round_method=None, scale_type=None, scale_format=None, scale_calculation_mode=None, qat_spec=None, mx_element_dtype=None, zero_point_type=<ZeroPointType.int32: 'int32'>, is_scale_quant=False), bias=None, target_device=None), '*v_proj': QuantizationConfig(input_tensors=QuantizationSpec(dtype=<Dtype.fp8_e4m3: 'fp8_e4m3'>, observer_cls=<class 'quark.torch.quantization.observer.observer.PerTensorMinMaxObserver'>, is_dynamic=False, qscheme=<QSchemeType.per_tensor: 'per_tensor'>, ch_axis=None, group_size=None, symmetric=None, round_method=None, scale_type=None, scale_format=None, scale_calculation_mode=None, qat_spec=None, mx_element_dtype=None, zero_point_type=<ZeroPointType.int32: 'int32'>, is_scale_quant=False), output_tensors=QuantizationSpec(dtype=<Dtype.fp8_e4m3: 'fp8_e4m3'>, observer_cls=<class 'quark.torch.quantization.observer.observer.PerTensorMinMaxObserver'>, is_dynamic=False, qscheme=<QSchemeType.per_tensor: 'per_tensor'>, ch_axis=None, group_size=None, symmetric=None, round_method=None, scale_type=None, scale_format=None, scale_calculation_mode=None, qat_spec=None, mx_element_dtype=None, zero_point_type=<ZeroPointType.int32: 'int32'>, is_scale_quant=False), weight=QuantizationSpec(dtype=<Dtype.fp8_e4m3: 'fp8_e4m3'>, observer_cls=<class 'quark.torch.quantization.observer.observer.PerTensorMinMaxObserver'>, is_dynamic=False, qscheme=<QSchemeType.per_tensor: 'per_tensor'>, ch_axis=None, group_size=None, symmetric=None, round_method=None, scale_type=None, scale_format=None, scale_calculation_mode=None, qat_spec=None, mx_element_dtype=None, zero_point_type=<ZeroPointType.int32: 'int32'>, is_scale_quant=False), bias=None, target_device=None)},
    kv_cache_quant_config={'*k_proj': QuantizationConfig(input_tensors=QuantizationSpec(dtype=<Dtype.fp8_e4m3: 'fp8_e4m3'>, observer_cls=<class 'quark.torch.quantization.observer.observer.PerTensorMinMaxObserver'>, is_dynamic=False, qscheme=<QSchemeType.per_tensor: 'per_tensor'>, ch_axis=None, group_size=None, symmetric=None, round_method=None, scale_type=None, scale_format=None, scale_calculation_mode=None, qat_spec=None, mx_element_dtype=None, zero_point_type=<ZeroPointType.int32: 'int32'>, is_scale_quant=False), output_tensors=QuantizationSpec(dtype=<Dtype.fp8_e4m3: 'fp8_e4m3'>, observer_cls=<class 'quark.torch.quantization.observer.observer.PerTensorMinMaxObserver'>, is_dynamic=False, qscheme=<QSchemeType.per_tensor: 'per_tensor'>, ch_axis=None, group_size=None, symmetric=None, round_method=None, scale_type=None, scale_format=None, scale_calculation_mode=None, qat_spec=None, mx_element_dtype=None, zero_point_type=<ZeroPointType.int32: 'int32'>, is_scale_quant=False), weight=QuantizationSpec(dtype=<Dtype.fp8_e4m3: 'fp8_e4m3'>, observer_cls=<class 'quark.torch.quantization.observer.observer.PerTensorMinMaxObserver'>, is_dynamic=False, qscheme=<QSchemeType.per_tensor: 'per_tensor'>, ch_axis=None, group_size=None, symmetric=None, round_method=None, scale_type=None, scale_format=None, scale_calculation_mode=None, qat_spec=None, mx_element_dtype=None, zero_point_type=<ZeroPointType.int32: 'int32'>, is_scale_quant=False), bias=None, target_device=None), '*v_proj': QuantizationConfig(input_tensors=QuantizationSpec(dtype=<Dtype.fp8_e4m3: 'fp8_e4m3'>, observer_cls=<class 'quark.torch.quantization.observer.observer.PerTensorMinMaxObserver'>, is_dynamic=False, qscheme=<QSchemeType.per_tensor: 'per_tensor'>, ch_axis=None, group_size=None, symmetric=None, round_method=None, scale_type=None, scale_format=None, scale_calculation_mode=None, qat_spec=None, mx_element_dtype=None, zero_point_type=<ZeroPointType.int32: 'int32'>, is_scale_quant=False), output_tensors=QuantizationSpec(dtype=<Dtype.fp8_e4m3: 'fp8_e4m3'>, observer_cls=<class 'quark.torch.quantization.observer.observer.PerTensorMinMaxObserver'>, is_dynamic=False, qscheme=<QSchemeType.per_tensor: 'per_tensor'>, ch_axis=None, group_size=None, symmetric=None, round_method=None, scale_type=None, scale_format=None, scale_calculation_mode=None, qat_spec=None, mx_element_dtype=None, zero_point_type=<ZeroPointType.int32: 'int32'>, is_scale_quant=False), weight=QuantizationSpec(dtype=<Dtype.fp8_e4m3: 'fp8_e4m3'>, observer_cls=<class 'quark.torch.quantization.observer.observer.PerTensorMinMaxObserver'>, is_dynamic=False, qscheme=<QSchemeType.per_tensor: 'per_tensor'>, ch_axis=None, group_size=None, symmetric=None, round_method=None, scale_type=None, scale_format=None, scale_calculation_mode=None, qat_spec=None, mx_element_dtype=None, zero_point_type=<ZeroPointType.int32: 'int32'>, is_scale_quant=False), bias=None, target_device=None)},
    softmax_quant_spec=None,
    exclude=['lm_head'],
    algo_config=None,
    pre_quant_opt_config=[
    ],
    quant_mode=QuantizationMode.eager_mode,
    log_severity_level=1,
    version="0.9+1241e27",
)

[QUARK-WARNING]: Lack of specific information of pre-optimization configuration. However, PyTorch version 2.8.0+cu126 detected. Only torch versions between 2.2 and 2.4 support auto generating algorithms configuration.

[QUARK-INFO]: In-place OPs replacement start.

[QUARK-INFO]: Module exclusion from quantization summary:
|      Exclude pattern       | Number of modules excluded |
|          lm_head           |             1              |

100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 319/319 [00:00<00:00, 7366.45it/s]

[QUARK-INFO]: Module replacement for quantization summary:
|            Original module             |  Number original   |  Number replaced   |
|                 Conv2d                 |         0          |         0          |
|                 Linear                 |        169         |        168         |
|            ConvTranspose2d             |         0          |         0          |
|               Embedding                |         1          |         0          |
|              EmbeddingBag              |         0          |         0          |
|            Qwen2ForCausalLM            |         1          |         0          |
|               Qwen2Model               |         1          |         0          |
|               ModuleList               |         1          |         0          |
|           Qwen2DecoderLayer            |         24         |         0          |
|             Qwen2Attention             |         24         |         0          |
|                Qwen2MLP                |         24         |         0          |
|                  SiLU                  |         24         |         0          |
|              Qwen2RMSNorm              |         49         |         0          |
|          Qwen2RotaryEmbedding          |         1          |         0          |


[QUARK-INFO]: In-place OPs replacement end.

[QUARK-INFO]: Calibration start.
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 128/128 [00:09<00:00, 13.69it/s]

[QUARK-INFO]: Calibration end.

[QUARK-INFO]: Model quantization has been completed.

[QUARK-INFO]: Freeze model start.

[QUARK-INFO]: Freeze model end.

[INFO]: Exporting onnx graph...

[QUARK-INFO]: Start exporting quantized onnx model ...
/home/karam/miniforge3/envs/quark/lib/python3.10/site-packages/transformers/masking_utils.py:521: TracerWarning: torch.tensor results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect.
  mask = torch.where(mask, torch.tensor(0.0, device=mask.device, dtype=dtype), min_dtype)
/home/karam/miniforge3/envs/quark/lib/python3.10/site-packages/transformers/cache_utils.py:92: TracerWarning: torch.tensor results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect.
  self.keys = torch.tensor([], dtype=self.dtype, device=self.device)
/home/karam/miniforge3/envs/quark/lib/python3.10/site-packages/transformers/cache_utils.py:93: TracerWarning: torch.tensor results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect.
  self.values = torch.tensor([], dtype=self.dtype, device=self.device)
Traceback (most recent call last):
  File "/home/karam/work/projects/19_quark/Quark/examples/torch/language_modeling/llm_ptq/quantize_quark.py", line 269, in <module>
    main(args)
  File "/home/karam/work/projects/19_quark/Quark/examples/torch/language_modeling/llm_ptq/quantize_quark.py", line 152, in main
    exporter.export_onnx_model(model, input_args, uint4_int4_flag=uint4_int4_flag)
  File "/home/karam/work/projects/19_quark/Quark/quark/torch/export/api.py", line 287, in export_onnx_model
    torch.onnx.export(model.eval(),
  File "/home/karam/miniforge3/envs/quark/lib/python3.10/site-packages/torch/onnx/__init__.py", line 424, in export
    export(
  File "/home/karam/miniforge3/envs/quark/lib/python3.10/site-packages/torch/onnx/utils.py", line 522, in export
    _export(
  File "/home/karam/miniforge3/envs/quark/lib/python3.10/site-packages/torch/onnx/utils.py", line 1457, in _export
    graph, params_dict, torch_out = _model_to_graph(
  File "/home/karam/miniforge3/envs/quark/lib/python3.10/site-packages/torch/onnx/utils.py", line 1080, in _model_to_graph
    graph, params, torch_out, module = _create_jit_graph(model, args)
  File "/home/karam/miniforge3/envs/quark/lib/python3.10/site-packages/torch/onnx/utils.py", line 964, in _create_jit_graph
    graph, torch_out = _trace_and_get_graph_from_model(model, args)
  File "/home/karam/miniforge3/envs/quark/lib/python3.10/site-packages/torch/onnx/utils.py", line 871, in _trace_and_get_graph_from_model
    trace_graph, torch_out, inputs_states = torch.jit._get_trace_graph(
  File "/home/karam/miniforge3/envs/quark/lib/python3.10/site-packages/torch/jit/_trace.py", line 1504, in _get_trace_graph
    outs = ONNXTracedModule(
  File "/home/karam/miniforge3/envs/quark/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/karam/miniforge3/envs/quark/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/karam/miniforge3/envs/quark/lib/python3.10/site-packages/torch/jit/_trace.py", line 138, in forward
    graph, _out = torch._C._create_graph_by_tracing(
  File "/home/karam/miniforge3/envs/quark/lib/python3.10/site-packages/torch/jit/_trace.py", line 132, in wrapper
    out_vars, _ = _flatten(outs)
RuntimeError: Only tuples, lists and Variables are supported as JIT inputs/outputs. Dictionaries and strings are also accepted, but their usage is not recommended. Here, received an input of unsupported type: DynamicCache

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions