2025.4.1 is a minor release with bug fixes and improvements based on OpenVINO 2025.4.1.
Preview:
Added preview support for GPT-OSS agentic use case.
As of 2025.4.1, the best accuracy setting is achieved with:
--pipeline_type LM(without continuous batching and concurrency)--target_device GPU(this configuration was validated on Lunar Lake, Arrow Lake-H, and Intel Arc Battlemage dGPU with >=16 GB VRAM)
It is also required to use INT4 precision.
Bug fixes:
- Fixed escaping for whitespace characters in string arguments for qwen3coder tool-call parser.
- Changed requests handling to
chat/completionsendpoint with streaming and usage tracking to LLM pipelines without continuous batching. Such pipelines do not track generated tokens. So far the last chunk wasn't delivered to the client which could result in a missing token in the response. Now the last chunk is delivered with token usage set as 0 which should be ignored. - Minor documentation and demos fixes
You can use an OpenVINO Model Server public Docker images based on Ubuntu via the following command:
docker pull openvino/model_server:2025.4.1- CPU device support with image based on Ubuntu 24.04
docker pull openvino/model_server:2025.4.1-gpu - GPU, NPU and CPU device support with image based on Ubuntu 24.04
or use provided binary packages. Only packages with sufffix _python_on have support for python.
There is also additional distribution channel via https://storage.openvinotoolkit.org/repositories/openvino_model_server/packages/2025.4.1/