Enhanced support for Qwen3-MOE models and gpt-oss-20b
They deliver now improved performance, accuracy, and robust concurrent request handling with continuous batching capabilities. These models are now available in pre-optimized OpenVINO™ format directly on the Hugging Face hub, making it very easy to deploy them. Check the demos how to used them
- Integration with agentic framework
- Integration with Visual Studio Code
- Integration with OpenWebUI
Added support for Qwen3-VL
This models family gives function calling capabilities, enabling this vision language model in agentic scenarios. Use examples are included in the demos mentioned above.
Extended /image endpoint to support inpainting and outpainting capabilities.
It is now possible to pass the input image along with a mask to edit parts of the image or to extend the input image.
Check how to use those capabilities in the image generation demo
Other improvements and fixes:
- Server logs now report current KV cache allocation alongside current usage metrics. With dynamic cache size (default setting), allocation automatically scales during runtime based on the request’s concurrency and processed context length.
- Generation request cancellation is now supported for NPU devices, where requests from disconnected clients will be cancelled.
- The finish reason now returns
tool_callswhen the model generates a function call, in line with OpenAI API standards. - Corrected tokens usage reporting in the text generation last streaming event with NPU execution
- Added extra streaming event right after the first token is generated, in line with OpenAI API. This will correct TTFT metric benchmarking using tools relying on streaming events.
- Enhanced error handling for Hugging Face Hub model pulling/downloads includes retry and resume capabilities to address network connectivity issues with large model files. Download operations can now recover from previous network connectivity errors or be reported in logs when recovery is not possible.
You can use an OpenVINO Model Server public Docker images based on Ubuntu via the following command:
docker pull openvino/model_server:2026.1- CPU device support with image based on Ubuntu 24.04docker pull openvino/model_server:2026.1-gpu- GPU, NPU and CPU device support with image based on Ubuntu 24.04
or use provided binary packages. Only packages with suffix _python_on have support for python.
There is also additional distribution channel via https://storage.openvinotoolkit.org/repositories/openvino_model_server/packages/2026.1.0/
Check the instructions how to install the binary package. The prebuilt image is available also on RedHat Ecosystem Catalog