Faster cli startup / helper #1460

IlyasMoutawwakil · 2025-10-06T14:54:31Z

What does this PR do?

(optimum-intel) (base) ilyas@hf-dgx-01:~/optimum-intel$ time optimum-cli export openvino -h
usage: optimum-cli export openvino [-h] -m MODEL [--task TASK] [--framework {pt,tf}] [--trust-remote-code]
                                   [--weight-format {fp32,fp16,int8,int4,mxfp4,nf4,cb4}]
                                   [--quant-mode {int8,f8e4m3,f8e5m2,nf4_f8e4m3,nf4_f8e5m2,cb4_f8e4m3,int4_f8e4m3,int4_f8e5m2}]
                                   [--library {transformers,diffusers,timm,sentence_transformers,open_clip}] [--cache_dir CACHE_DIR]
                                   [--pad-token-id PAD_TOKEN_ID] [--variant VARIANT] [--ratio RATIO] [--sym] [--group-size GROUP_SIZE]
                                   [--backup-precision {none,int8_sym,int8_asym}] [--dataset DATASET] [--all-layers] [--awq] [--scale-estimation]
                                   [--gptq] [--lora-correction] [--sensitivity-metric SENSITIVITY_METRIC]
                                   [--quantization-statistics-path QUANTIZATION_STATISTICS_PATH] [--num-samples NUM_SAMPLES] [--disable-stateful]
                                   [--disable-convert-tokenizer] [--smooth-quant-alpha SMOOTH_QUANT_ALPHA] [--model-kwargs MODEL_KWARGS]
                                   output

options:
  -h, --help            show this help message and exit

Required arguments:
  -m MODEL, --model MODEL
                        Model ID on huggingface.co or path on disk to load model from.
  output                Path indicating the directory where to store the generated OV model.

Optional arguments:
  --task TASK           The task to export the model for. If not specified, the task will be auto-inferred from the model's metadata or files.
                        For tasks that generate text, add the `xxx-with-past` suffix to export the model using past key values caching. Available
                        tasks depend on the model, but are among the following list: ['audio-classification', 'audio-frame-classification',
                        'audio-xvector', 'automatic-speech-recognition', 'depth-estimation', 'document-question-answering', 'feature-extraction',
                        'fill-mask', 'image-classification', 'image-segmentation', 'image-text-to-text', 'image-to-image', 'image-to-text',
                        'inpainting', 'keypoint-detection', 'mask-generation', 'masked-im', 'multiple-choice', 'object-detection', 'question-
                        answering', 'reinforcement-learning', 'semantic-segmentation', 'sentence-similarity', 'text-classification', 'text-
                        generation', 'text-to-audio', 'text-to-image', 'text2text-generation', 'time-series-forecasting', 'token-classification',
                        'visual-question-answering', 'zero-shot-image-classification', 'zero-shot-object-detection'].
  --framework {pt,tf}   The framework to use for the export. If not provided, will attempt to use the local checkpoint's original framework or
                        what is available in the environment.
  --trust-remote-code   Allows to use custom code for the modeling hosted in the model repository. This option should only be set for
                        repositories you trust and in which you have read the code, as it will execute on your local machine arbitrary code
                        present in the model repository.
  --weight-format {fp32,fp16,int8,int4,mxfp4,nf4,cb4}
                        The weight format of the exported model. Option 'cb4' represents a codebook with 16 fixed fp8 values in E4M3 format.
  --quant-mode {int8,f8e4m3,f8e5m2,nf4_f8e4m3,nf4_f8e5m2,cb4_f8e4m3,int4_f8e4m3,int4_f8e5m2}
                        Quantization precision mode. This is used for applying full model quantization including activations.
  --library {transformers,diffusers,timm,sentence_transformers,open_clip}
                        The library used to load the model before export. If not provided, will attempt to infer the local checkpoint's library
  --cache_dir CACHE_DIR
                        The path to a directory in which the downloaded model should be cached if the standard cache should not be used.
  --pad-token-id PAD_TOKEN_ID
                        This is needed by some models, for some tasks. If not provided, will attempt to use the tokenizer to guess it.
  --variant VARIANT     If specified load weights from variant filename.
  --ratio RATIO         A parameter used when applying 4-bit quantization to control the ratio between 4-bit and 8-bit quantization. If set to
                        0.8, 80% of the layers will be quantized to int4 while 20% will be quantized to int8. This helps to achieve better
                        accuracy at the sacrifice of the model size and inference latency. Default value is 1.0. Note: If dataset is provided,
                        and the ratio is less than 1.0, then data-aware mixed precision assignment will be applied.
  --sym                 Whether to apply symmetric quantization. This argument is related to integer-typed --weight-format and --quant-mode
                        options. In case of full or mixed quantization (--quant-mode) symmetric quantization will be applied to weights in any
                        case, so only activation quantization will be affected by --sym argument. For weight-only quantization (--weight-format)
                        --sym argument does not affect backup precision. Examples: (1) --weight-format int8 --sym => int8 symmetric quantization
                        of weights; (2) --weight-format int4 => int4 asymmetric quantization of weights; (3) --weight-format int4 --sym --backup-
                        precision int8_asym => int4 symmetric quantization of weights with int8 asymmetric backup precision; (4) --quant-mode
                        int8 --sym => weights and activations are quantized to int8 symmetric data type; (5) --quant-mode int8 => activations are
                        quantized to int8 asymmetric data type, weights -- to int8 symmetric data type; (6) --quant-mode int4_f8e5m2 --sym =>
                        activations are quantized to f8e5m2 data type, weights -- to int4 symmetric data type.
  --group-size GROUP_SIZE
                        The group size to use for quantization. Recommended value is 128 and -1 uses per-column quantization.
  --backup-precision {none,int8_sym,int8_asym}
                        Defines a backup precision for mixed-precision weight compression. Only valid for 4-bit weight formats. If not provided,
                        backup precision is int8_asym. 'none' stands for original floating-point precision of the model weights, in this case
                        weights are retained in their original precision without any quantization. 'int8_sym' stands for 8-bit integer symmetric
                        quantization without zero point. 'int8_asym' stands for 8-bit integer asymmetric quantization with zero points per each
                        quantization group.
  --dataset DATASET     The dataset used for data-aware compression or quantization with NNCF. For language models you can use the one from the
                        list ['auto','wikitext2','c4','c4-new']. With 'auto' the dataset will be collected from model's generations. For
                        diffusion models it should be on of ['conceptual_captions','laion/220k-GPT4Vision-captions-from-LIVIS','laion/filtered-
                        wit']. For visual language models the dataset must be set to 'contextual'. Note: if none of the data-aware compression
                        algorithms are selected and ratio parameter is omitted or equals 1.0, the dataset argument will not have an effect on the
                        resulting model.Note: for text generation task, datasets with English texts such as 'wikitext2','c4' or 'c4-new' usually
                        work fine even for non-English models.
  --all-layers          Whether embeddings and last MatMul layers should be compressed to INT4. If not provided an weight compression is applied,
                        they are compressed to INT8.
  --awq                 Whether to apply AWQ algorithm. AWQ improves generation quality of INT4-compressed LLMs. If dataset is provided, a data-
                        aware activation-based version of the algorithm will be executed, which requires additional time. Otherwise, data-free
                        AWQ will be applied which relies on per-column magnitudes of weights instead of activations. Note: it is possible that
                        there will be no matching patterns in the model to apply AWQ, in such case it will be skipped.
  --scale-estimation    Indicates whether to apply a scale estimation algorithm that minimizes the L2 error between the original and compressed
                        layers. Providing a dataset is required to run scale estimation. Please note, that applying scale estimation takes
                        additional memory and time.
  --gptq                Indicates whether to apply GPTQ algorithm that optimizes compressed weights in a layer-wise fashion to minimize the
                        difference between activations of a compressed and original layer. Please note, that applying GPTQ takes additional
                        memory and time.
  --lora-correction     Indicates whether to apply LoRA Correction algorithm. When enabled, this algorithm introduces low-rank adaptation layers
                        in the model that can recover accuracy after weight compression at some cost of inference latency. Please note, that
                        applying LoRA Correction algorithm takes additional memory and time.
  --sensitivity-metric SENSITIVITY_METRIC
                        The sensitivity metric for assigning quantization precision to layers. It can be one of the following:
                        ['weight_quantization_error', 'hessian_input_activation', 'mean_activation_variance', 'max_activation_variance',
                        'mean_activation_magnitude'].
  --quantization-statistics-path QUANTIZATION_STATISTICS_PATH
                        Directory path to dump/load data-aware weight-only quantization statistics. This is useful when running data-aware
                        quantization multiple times on the same model and dataset to avoid recomputing statistics. This option is applicable
                        exclusively for weight-only quantization. Please note that the statistics depend on the dataset, so if you change the
                        dataset, you should also change the statistics path to avoid confusion.
  --num-samples NUM_SAMPLES
                        The maximum number of samples to take from the dataset for quantization.
  --disable-stateful    Disable stateful converted models, stateless models will be generated instead. Stateful models are produced by default
                        when this key is not used. In stateful models all kv-cache inputs and outputs are hidden in the model and are not exposed
                        as model inputs and outputs. If --disable-stateful option is used, it may result in sub-optimal inference performance.
                        Use it when you intentionally want to use a stateless model, for example, to be compatible with existing OpenVINO native
                        inference code that expects KV-cache inputs and outputs in the model.
  --disable-convert-tokenizer
                        Do not add converted tokenizer and detokenizer OpenVINO models.
  --smooth-quant-alpha SMOOTH_QUANT_ALPHA
                        SmoothQuant alpha parameter that improves the distribution of activations before MatMul layers and reduces quantization
                        error. Valid only when activations quantization is enabled.
  --model-kwargs MODEL_KWARGS
                        Any kwargs passed to the model forward, or used to customize the export for a given model.

real    0m0,137s
user    0m0,113s
sys     0m0,024s

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

setup.py

echarlaix

LGTM, thanks @IlyasMoutawwakil !!

IlyasMoutawwakil added 3 commits October 6, 2025 16:53

faster cli

d52d172

only support pt framework

37ab1d3

style

2dbf456

IlyasMoutawwakil commented Oct 7, 2025

View reviewed changes

setup.py Outdated Show resolved Hide resolved

Apply suggestion from @IlyasMoutawwakil

d1cbfca

IlyasMoutawwakil requested a review from echarlaix October 7, 2025 12:13

echarlaix approved these changes Oct 7, 2025

View reviewed changes

echarlaix merged commit ec96471 into main Oct 7, 2025
29 checks passed

echarlaix deleted the cli-optimumazation branch October 7, 2025 13:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Faster cli startup / helper #1460

Faster cli startup / helper #1460

Uh oh!

IlyasMoutawwakil commented Oct 6, 2025

Uh oh!

Uh oh!

echarlaix left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Faster cli startup / helper #1460

Faster cli startup / helper #1460

Uh oh!

Conversation

IlyasMoutawwakil commented Oct 6, 2025

What does this PR do?

Before submitting

Uh oh!

Uh oh!

echarlaix left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants