Kimi K2.5 is an open-source, native multimodal agentic model built through continual pretraining on approximately 15 trillion mixed visual and text tokens atop Kimi-K2-Base. It seamlessly integrates vision and language understanding with advanced agentic capabilities, instant and thinking modes, as well as conversational and agentic paradigms.
Pull the vLLM release image from Docker Hub:
docker pull vllm/vllm-openai:v0.17.0-cu130 # CUDA 13.0
docker pull vllm/vllm-openai:v0.17.0 # Other CUDA versionsVerified on 8×H200 GPUs:
docker run --gpus all \
-p 8000:8000 \
--ipc=host \
vllm/vllm-openai:v0.17.0-cu130 moonshotai/Kimi-K2.5 \
--tensor-parallel-size 8 \
--mm-encoder-tp-mode data \
--compilation_config.pass_config.fuse_allreduce_rms true \
--tool-call-parser kimi_k2 \
--reasoning-parser kimi_k2 \
--enable-auto-tool-choice \
--trust-remote-codeNVIDIA Blackwell (e.g., GB200) is also supported via the aarch64 image:
docker run --gpus all \
-p 8000:8000 \
--ipc=host \
vllm/vllm-openai:v0.17.0-aarch64-cu130 moonshotai/Kimi-K2.5 \
--tensor-parallel-size 4 \
--mm-encoder-tp-mode data \
--compilation_config.pass_config.fuse_allreduce_rms true \
--tool-call-parser kimi_k2 \
--reasoning-parser kimi_k2 \
--enable-auto-tool-choice \
--trust-remote-codeuv venv
source .venv/bin/activate
uv pip install vllm --torch-backend autoSee the following command to deploy Kimi-K2.5 with the vLLM inference server. The configuration below has been verified on 8xH200 GPUs.
vllm serve moonshotai/Kimi-K2.5 -tp 8 \
--mm-encoder-tp-mode data \
--compilation_config.pass_config.fuse_allreduce_rms true \
--tool-call-parser kimi_k2 \
--reasoning-parser kimi_k2 \
--enable-auto-tool-choice \
--trust-remote-codeThe --reasoning-parser flag specifies the reasoning parser to use for extracting reasoning content from the model output.
--async-schedulinghas been turned on by default to improve the overall system performance by overlapping scheduling overhead with the decoding process. If you run into issue with this feature, please try turning off this feature and file a bug report to vLLM.- Specifying
--mm-encoder-tp-mode datadeploys the vision encoder in a data-parallel fashion for better performance. This is because the vision encoder is very small, thus tensor parallelism brings little gain but incurs significant communication overhead. Enabling this feature does consume additional memory and may require adjustment on--gpu-memory-utilization. - If your workload involves mostly unique multimodal inputs only, it is recommended to pass
--mm-processor-cache-gb 0to avoid caching overhead. Otherwise, specifying--mm-processor-cache-type shmenables this experimental feature which utilizes host shared memory to cache preprocessed input images and/or videos which shows better performance at a high TP setting. - vLLM supports Expert Parallelism (EP) via
--enable-expert-parallel, which allows experts in MoE models to be deployed on separate GPUs for better throughput. Check out Expert Parallelism Deployment for more details. - You can use benchmark_moe to perform MoE Triton kernel tuning for your hardware.
Once the server for the moonshotai/Kimi-K2.5 model is running, open another terminal and run the benchmark client:
vllm bench serve \
--backend openai-chat \
--endpoint /v1/chat/completions \
--model moonshotai/Kimi-K2.5 \
--dataset-name hf \
--dataset-path lmarena-ai/VisionArena-Chat \
--num-prompts 1000 \
--request-rate 20import time
from openai import OpenAI
client = OpenAI(
api_key="EMPTY",
base_url="http://localhost:8000/v1",
timeout=3600
)
messages = [
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": "https://ofasys-multimodal-wlcb-3-toshanghai.oss-accelerate.aliyuncs.com/wpf272043/keepme/image/receipt.png"
}
},
{
"type": "text",
"text": "Read all the text in the image."
}
]
}
]
start = time.time()
response = client.chat.completions.create(
model="moonshotai/Kimi-K2.5",
messages=messages,
max_tokens=2048
)
print(f"Response costs: {time.time() - start:.2f}s")
print(f"Generated text: {response.choices[0].message.content}")For more usage examples, check out the vLLM user guide for multimodal models and the official Kimi-K2.5 Hugging Face page!