Skip to content

Latest commit

 

History

History
104 lines (70 loc) · 7.21 KB

File metadata and controls

104 lines (70 loc) · 7.21 KB

OpenVINO GenAI LLM tips

Inference best practices

Optimum Intel

Optimum Intel is a collaboration of Hugging Face and Intel to accelerate transformers models on Intel hardware. With OpenVINO GenAI, we use Optimum Intel to export transformer models from the Hugging Face Hub to OpenVINO, using optimum-cli.

Install optimum-intel or upgrade to the latest version with:

pip install --upgrade --upgrade-strategy eager optimum-intel[openvino]

To get the version with the latest bugfixes and improvements, install optimum-intel from source with this command (this requires that Git is installed):

pip install --upgrade --upgrade-strategy eager "optimum-intel[openvino]"@git+https://github.com/huggingface/optimum-intel.git

See this installation guide for step by step instructions and tips.

Optimum is only used for exporting models. It is not needed for running inference.

Documentation: Optimum Intel model export

INT4 weight quantization

The NNCF team has created recommended quantization configs for popular models which you can see here https://github.com/huggingface/optimum-intel/blob/main/optimum/intel/openvino/configuration.py#L44 . If you use optimum-cli with --weight-format int4 and no other weight compression options, this default config is applied. The configs have been found to have a good balance of accuracy and performance. That doesn't mean they are the best for all use cases, but they are a good starting point. Note that for NPU some modifications are needed (see the NPU section below).

This documentation page explains the parameters to use for INT4 weight compression: https://docs.openvino.ai/2025/openvino-workflow/model-optimization-guide/weight-compression/4-bit-weight-quantization.html It is a generic page about weight quantization options, optimum-cli export openvino --help shows the exact options to use with optimum-cli. For models that do not have default configs (yet), it is often helpful to use --awq --dataset wikitext2 for better accuracy.

Nightly

If you encounter issues with a particular model, it is often useful to try a nightly build:

pip install --pre --upgrade openvino-genai openvino openvino-tokenizers --extra-index-url https://storage.openvinotoolkit.org/simple/wheels/nightly

(there may also be issues with nightly, so this is not a recommendation to use nightly everywhere).

NPU

For NPU, please refer to the GenAI NPU documentation or this step-by-step guide. Most importantly:

  • Make sure that the NPU driver is updated (Windows, Linux)
  • Use symmetric INT4 quantization (--weight-format int4 --sym) to export the model
    • for models > 4GB, use channel-wise quantization: --group-size -1. Group-size quantization may work, but model loading time will be very slow.
    • for smaller models, both channel-wise (--group-size -1) and group-wise (--group-size 128) quantization is supported. Channel-wise quantization generally results in faster inference, faster model loading time, and lower memory use, so it is a good method to start with. Try group-wise quantization if accuracy with channel-wise quantization is not acceptable (for accuracy also see the note about --awq above).
  • Use OpenVINO GenAI 2025.1 or later (in general, the latest OpenVINO GenAI version is recommended).
  • Use model caching: set {"CACHE_DIR": "model_cache"} in pipeline_config and load the model with pipe = ov_genai.LLMPipeline(model_path, "NPU", **pipeline_config) (NOTE: see known issues section)
  • NPU pipeline_config options:
    • {"MAX_PROMPT_LEN": 2048, "MIN_RESPONSE_LEN": 512}. MAX_PROMPT_LEN is 1024 by default, MIN_RESPONSE_LEN 128. If your input size is larger than 1024, or you expect more than 128 tokens in the output, set these options. MIN_RESPONSE_LEN will not cause more tokens to be generated than specified with config.max_new_tokens, and generating is still stopped if an EOS token is encountered.

Example optimum-cli command to export an NPU friendly model (which also works on other devices):

optimum-cli export openvino -m meta-llama/Llama-2-7b-chat-hf --weight-format int4 --sym --group-size -1 --ratio 1.0 --awq --scale-estimation --dataset wikitext2 Llama-2-7b-chat-hf-ov-int4

Use optimum-cli export openvino --help to see all options.

Known issues

2025.3

  • For NPU, CACHE_DIR does not speed up model loading in 2025.3. This is fixed in nightly (see above). For 2025.3, it is an option to use NPUW_CACHE_DIR instead of CACHE_DIR (this works only for NPU and does not result in as much speedup as CACHE_DIR).
  • On GPU, first inference is expected to be different from subsequent inferences. If first inference is worse than second inference, please report an issue. To prevent this variability, it can be useful to add a warmup inference: pipe.generate("hello", max_new_tokens=1).

Earlier versions

  • With OpenVINO GenAI 2025.1 and 2025.2 system prompts are ignored on CPU and GPU when using .start_chat(). This is fixed in 2025.3
  • With OpenVINO 2025.1 (and earlier), there is an issue when running inference on per-channel quantized INT4 models (exported with optimum-cli export openvino --group-size -1) on Meteor Lake GPU. The model generates nonsense. This issue is fixed in 2025.2

OpenVINO models on Hugging Face hub

The Hugging Face hub hosts many popular LLMs in OpenVINO format. Check out OpenVINO's and LLMWare's models. LLMWare also has several NPU friendly models. These have npu-ov in the name. These models work well on both GPU and NPU.

To download these models, use:

huggingface-cli download model_id --local-dir local-model-dir

For example:

huggingface-cli download llmware/llama-3.2-3b-instruct-npu-ov --local-dir llama-3.2-3b-instruct-npu-ov

huggingface-cli is installed with pip install huggingface-hub[cli]. This is already installed if you have optimum-intel installed in your Python environment.