ONNX conversion support

When trying out onnx graph export in llm_ptq examples: 

```powershell
RuntimeError: Only tuples, lists and Variables are supported as JIT inputs/outputs. Dictionaries and strings are also accepted, but their usage is not recommended. Here, received an input of unsupported type: DynamicCache
```

---

Is there planned feature development for onnx export with models in HF implementation? Thanks!


**Full trace:**

```powershell
(quark) ➜  llm_ptq git:(release/0.9) ✗ python quantize_quark.py --model_dir ../../../../../models/qwen/qwen1.5-0.5b \
--quant_scheme w_fp8_a_fp8 --kv_cache_dtype fp8 --num_calib_data 128 --model_export onnx --output_dir ./qwen1.5-0.5b

[QUARK-INFO]: C++ kernel compilation check start.

[QUARK-INFO]: C++ kernel build directory /home/karam/.cache/torch_extensions/py310_cu126/kernel_ext

[QUARK-INFO]: C++ kernel loading. First-time compilation may take a few minutes...
W0911 12:19:40.967000 957122 site-packages/torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
W0911 12:19:40.967000 957122 site-packages/torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.

[QUARK-INFO]: C++ kernel compilation is already complete. Ending the C++ kernel compilation check. Total time: 0.0249 seconds

[INFO]: Loading model ...
`torch_dtype` is deprecated! Use `dtype` instead!
Initializing tokenizer from ../../../../../models/qwen/qwen1.5-0.5b

[INFO]: Loading dataset ...
Repo card metadata block was not found. Setting CardData to empty.

[QUARK-INFO]: Configuration checking start.

[QUARK-INFO]: Configuration checking end. The configuration is effective. This is weight quantization and activation static quantization.

[QUARK-INFO]: Quantizing with the quantization configuration:
Config(
    global_quant_config=QuantizationConfig(
        input_tensors=QuantizationSpec(
            dtype=Dtype.fp8_e4m3,
            observer_cls=<class 'quark.torch.quantization.observer.observer.PerTensorMinMaxObserver'>,
            is_dynamic=False,
            qscheme=QSchemeType.per_tensor,
            ch_axis=None,
            group_size=None,
            symmetric=None,
            round_method=None,
            scale_type=None,
            scale_format=None,
            scale_calculation_mode=None,
            qat_spec=None,
            mx_element_dtype=None,
            zero_point_type=ZeroPointType.int32,
            is_scale_quant=False,
        ),
        output_tensors=None,
        weight=QuantizationSpec(
            dtype=Dtype.fp8_e4m3,
            observer_cls=<class 'quark.torch.quantization.observer.observer.PerTensorMinMaxObserver'>,
            is_dynamic=False,
            qscheme=QSchemeType.per_tensor,
            ch_axis=None,
            group_size=None,
            symmetric=None,
            round_method=None,
            scale_type=None,
            scale_format=None,
            scale_calculation_mode=None,
            qat_spec=None,
            mx_element_dtype=None,
            zero_point_type=ZeroPointType.int32,
            is_scale_quant=False,
        ),
        bias=None,
        target_device=None,
    ),
    layer_type_quant_config={},
    layer_quant_config={'*k_proj': QuantizationConfig(input_tensors=QuantizationSpec(dtype=<Dtype.fp8_e4m3: 'fp8_e4m3'>, observer_cls=<class 'quark.torch.quantization.observer.observer.PerTensorMinMaxObserver'>, is_dynamic=False, qscheme=<QSchemeType.per_tensor: 'per_tensor'>, ch_axis=None, group_size=None, symmetric=None, round_method=None, scale_type=None, scale_format=None, scale_calculation_mode=None, qat_spec=None, mx_element_dtype=None, zero_point_type=<ZeroPointType.int32: 'int32'>, is_scale_quant=False), output_tensors=QuantizationSpec(dtype=<Dtype.fp8_e4m3: 'fp8_e4m3'>, observer_cls=<class 'quark.torch.quantization.observer.observer.PerTensorMinMaxObserver'>, is_dynamic=False, qscheme=<QSchemeType.per_tensor: 'per_tensor'>, ch_axis=None, group_size=None, symmetric=None, round_method=None, scale_type=None, scale_format=None, scale_calculation_mode=None, qat_spec=None, mx_element_dtype=None, zero_point_type=<ZeroPointType.int32: 'int32'>, is_scale_quant=False), weight=QuantizationSpec(dtype=<Dtype.fp8_e4m3: 'fp8_e4m3'>, observer_cls=<class 'quark.torch.quantization.observer.observer.PerTensorMinMaxObserver'>, is_dynamic=False, qscheme=<QSchemeType.per_tensor: 'per_tensor'>, ch_axis=None, group_size=None, symmetric=None, round_method=None, scale_type=None, scale_format=None, scale_calculation_mode=None, qat_spec=None, mx_element_dtype=None, zero_point_type=<ZeroPointType.int32: 'int32'>, is_scale_quant=False), bias=None, target_device=None), '*v_proj': QuantizationConfig(input_tensors=QuantizationSpec(dtype=<Dtype.fp8_e4m3: 'fp8_e4m3'>, observer_cls=<class 'quark.torch.quantization.observer.observer.PerTensorMinMaxObserver'>, is_dynamic=False, qscheme=<QSchemeType.per_tensor: 'per_tensor'>, ch_axis=None, group_size=None, symmetric=None, round_method=None, scale_type=None, scale_format=None, scale_calculation_mode=None, qat_spec=None, mx_element_dtype=None, zero_point_type=<ZeroPointType.int32: 'int32'>, is_scale_quant=False), output_tensors=QuantizationSpec(dtype=<Dtype.fp8_e4m3: 'fp8_e4m3'>, observer_cls=<class 'quark.torch.quantization.observer.observer.PerTensorMinMaxObserver'>, is_dynamic=False, qscheme=<QSchemeType.per_tensor: 'per_tensor'>, ch_axis=None, group_size=None, symmetric=None, round_method=None, scale_type=None, scale_format=None, scale_calculation_mode=None, qat_spec=None, mx_element_dtype=None, zero_point_type=<ZeroPointType.int32: 'int32'>, is_scale_quant=False), weight=QuantizationSpec(dtype=<Dtype.fp8_e4m3: 'fp8_e4m3'>, observer_cls=<class 'quark.torch.quantization.observer.observer.PerTensorMinMaxObserver'>, is_dynamic=False, qscheme=<QSchemeType.per_tensor: 'per_tensor'>, ch_axis=None, group_size=None, symmetric=None, round_method=None, scale_type=None, scale_format=None, scale_calculation_mode=None, qat_spec=None, mx_element_dtype=None, zero_point_type=<ZeroPointType.int32: 'int32'>, is_scale_quant=False), bias=None, target_device=None)},
    kv_cache_quant_config={'*k_proj': QuantizationConfig(input_tensors=QuantizationSpec(dtype=<Dtype.fp8_e4m3: 'fp8_e4m3'>, observer_cls=<class 'quark.torch.quantization.observer.observer.PerTensorMinMaxObserver'>, is_dynamic=False, qscheme=<QSchemeType.per_tensor: 'per_tensor'>, ch_axis=None, group_size=None, symmetric=None, round_method=None, scale_type=None, scale_format=None, scale_calculation_mode=None, qat_spec=None, mx_element_dtype=None, zero_point_type=<ZeroPointType.int32: 'int32'>, is_scale_quant=False), output_tensors=QuantizationSpec(dtype=<Dtype.fp8_e4m3: 'fp8_e4m3'>, observer_cls=<class 'quark.torch.quantization.observer.observer.PerTensorMinMaxObserver'>, is_dynamic=False, qscheme=<QSchemeType.per_tensor: 'per_tensor'>, ch_axis=None, group_size=None, symmetric=None, round_method=None, scale_type=None, scale_format=None, scale_calculation_mode=None, qat_spec=None, mx_element_dtype=None, zero_point_type=<ZeroPointType.int32: 'int32'>, is_scale_quant=False), weight=QuantizationSpec(dtype=<Dtype.fp8_e4m3: 'fp8_e4m3'>, observer_cls=<class 'quark.torch.quantization.observer.observer.PerTensorMinMaxObserver'>, is_dynamic=False, qscheme=<QSchemeType.per_tensor: 'per_tensor'>, ch_axis=None, group_size=None, symmetric=None, round_method=None, scale_type=None, scale_format=None, scale_calculation_mode=None, qat_spec=None, mx_element_dtype=None, zero_point_type=<ZeroPointType.int32: 'int32'>, is_scale_quant=False), bias=None, target_device=None), '*v_proj': QuantizationConfig(input_tensors=QuantizationSpec(dtype=<Dtype.fp8_e4m3: 'fp8_e4m3'>, observer_cls=<class 'quark.torch.quantization.observer.observer.PerTensorMinMaxObserver'>, is_dynamic=False, qscheme=<QSchemeType.per_tensor: 'per_tensor'>, ch_axis=None, group_size=None, symmetric=None, round_method=None, scale_type=None, scale_format=None, scale_calculation_mode=None, qat_spec=None, mx_element_dtype=None, zero_point_type=<ZeroPointType.int32: 'int32'>, is_scale_quant=False), output_tensors=QuantizationSpec(dtype=<Dtype.fp8_e4m3: 'fp8_e4m3'>, observer_cls=<class 'quark.torch.quantization.observer.observer.PerTensorMinMaxObserver'>, is_dynamic=False, qscheme=<QSchemeType.per_tensor: 'per_tensor'>, ch_axis=None, group_size=None, symmetric=None, round_method=None, scale_type=None, scale_format=None, scale_calculation_mode=None, qat_spec=None, mx_element_dtype=None, zero_point_type=<ZeroPointType.int32: 'int32'>, is_scale_quant=False), weight=QuantizationSpec(dtype=<Dtype.fp8_e4m3: 'fp8_e4m3'>, observer_cls=<class 'quark.torch.quantization.observer.observer.PerTensorMinMaxObserver'>, is_dynamic=False, qscheme=<QSchemeType.per_tensor: 'per_tensor'>, ch_axis=None, group_size=None, symmetric=None, round_method=None, scale_type=None, scale_format=None, scale_calculation_mode=None, qat_spec=None, mx_element_dtype=None, zero_point_type=<ZeroPointType.int32: 'int32'>, is_scale_quant=False), bias=None, target_device=None)},
    softmax_quant_spec=None,
    exclude=['lm_head'],
    algo_config=None,
    pre_quant_opt_config=[
    ],
    quant_mode=QuantizationMode.eager_mode,
    log_severity_level=1,
    version="0.9+1241e27",
)

[QUARK-WARNING]: Lack of specific information of pre-optimization configuration. However, PyTorch version 2.8.0+cu126 detected. Only torch versions between 2.2 and 2.4 support auto generating algorithms configuration.

[QUARK-INFO]: In-place OPs replacement start.

[QUARK-INFO]: Module exclusion from quantization summary:
|      Exclude pattern       | Number of modules excluded |
|          lm_head           |             1              |

100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 319/319 [00:00<00:00, 7366.45it/s]

[QUARK-INFO]: Module replacement for quantization summary:
|            Original module             |  Number original   |  Number replaced   |
|                 Conv2d                 |         0          |         0          |
|                 Linear                 |        169         |        168         |
|            ConvTranspose2d             |         0          |         0          |
|               Embedding                |         1          |         0          |
|              EmbeddingBag              |         0          |         0          |
|            Qwen2ForCausalLM            |         1          |         0          |
|               Qwen2Model               |         1          |         0          |
|               ModuleList               |         1          |         0          |
|           Qwen2DecoderLayer            |         24         |         0          |
|             Qwen2Attention             |         24         |         0          |
|                Qwen2MLP                |         24         |         0          |
|                  SiLU                  |         24         |         0          |
|              Qwen2RMSNorm              |         49         |         0          |
|          Qwen2RotaryEmbedding          |         1          |         0          |


[QUARK-INFO]: In-place OPs replacement end.

[QUARK-INFO]: Calibration start.
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 128/128 [00:09<00:00, 13.69it/s]

[QUARK-INFO]: Calibration end.

[QUARK-INFO]: Model quantization has been completed.

[QUARK-INFO]: Freeze model start.

[QUARK-INFO]: Freeze model end.

[INFO]: Exporting onnx graph...

[QUARK-INFO]: Start exporting quantized onnx model ...
/home/karam/miniforge3/envs/quark/lib/python3.10/site-packages/transformers/masking_utils.py:521: TracerWarning: torch.tensor results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect.
  mask = torch.where(mask, torch.tensor(0.0, device=mask.device, dtype=dtype), min_dtype)
/home/karam/miniforge3/envs/quark/lib/python3.10/site-packages/transformers/cache_utils.py:92: TracerWarning: torch.tensor results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect.
  self.keys = torch.tensor([], dtype=self.dtype, device=self.device)
/home/karam/miniforge3/envs/quark/lib/python3.10/site-packages/transformers/cache_utils.py:93: TracerWarning: torch.tensor results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect.
  self.values = torch.tensor([], dtype=self.dtype, device=self.device)
Traceback (most recent call last):
  File "/home/karam/work/projects/19_quark/Quark/examples/torch/language_modeling/llm_ptq/quantize_quark.py", line 269, in <module>
    main(args)
  File "/home/karam/work/projects/19_quark/Quark/examples/torch/language_modeling/llm_ptq/quantize_quark.py", line 152, in main
    exporter.export_onnx_model(model, input_args, uint4_int4_flag=uint4_int4_flag)
  File "/home/karam/work/projects/19_quark/Quark/quark/torch/export/api.py", line 287, in export_onnx_model
    torch.onnx.export(model.eval(),
  File "/home/karam/miniforge3/envs/quark/lib/python3.10/site-packages/torch/onnx/__init__.py", line 424, in export
    export(
  File "/home/karam/miniforge3/envs/quark/lib/python3.10/site-packages/torch/onnx/utils.py", line 522, in export
    _export(
  File "/home/karam/miniforge3/envs/quark/lib/python3.10/site-packages/torch/onnx/utils.py", line 1457, in _export
    graph, params_dict, torch_out = _model_to_graph(
  File "/home/karam/miniforge3/envs/quark/lib/python3.10/site-packages/torch/onnx/utils.py", line 1080, in _model_to_graph
    graph, params, torch_out, module = _create_jit_graph(model, args)
  File "/home/karam/miniforge3/envs/quark/lib/python3.10/site-packages/torch/onnx/utils.py", line 964, in _create_jit_graph
    graph, torch_out = _trace_and_get_graph_from_model(model, args)
  File "/home/karam/miniforge3/envs/quark/lib/python3.10/site-packages/torch/onnx/utils.py", line 871, in _trace_and_get_graph_from_model
    trace_graph, torch_out, inputs_states = torch.jit._get_trace_graph(
  File "/home/karam/miniforge3/envs/quark/lib/python3.10/site-packages/torch/jit/_trace.py", line 1504, in _get_trace_graph
    outs = ONNXTracedModule(
  File "/home/karam/miniforge3/envs/quark/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/karam/miniforge3/envs/quark/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/karam/miniforge3/envs/quark/lib/python3.10/site-packages/torch/jit/_trace.py", line 138, in forward
    graph, _out = torch._C._create_graph_by_tracing(
  File "/home/karam/miniforge3/envs/quark/lib/python3.10/site-packages/torch/jit/_trace.py", line 132, in wrapper
    out_vars, _ = _flatten(outs)
RuntimeError: Only tuples, lists and Variables are supported as JIT inputs/outputs. Dictionaries and strings are also accepted, but their usage is not recommended. Here, received an input of unsupported type: DynamicCache
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ONNX conversion support #8

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

ONNX conversion support #8

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions