Skip to content

Latest commit

 

History

History
131 lines (110 loc) · 5.18 KB

File metadata and controls

131 lines (110 loc) · 5.18 KB

moonshotai/Kimi-K2.5 Usage Guide

Kimi K2.5 is an open-source, native multimodal agentic model built through continual pretraining on approximately 15 trillion mixed visual and text tokens atop Kimi-K2-Base. It seamlessly integrates vision and language understanding with advanced agentic capabilities, instant and thinking modes, as well as conversational and agentic paradigms.

Use vLLM with Docker

Pull the vLLM release image from Docker Hub:

docker pull vllm/vllm-openai:v0.17.0-cu130 # CUDA 13.0
docker pull vllm/vllm-openai:v0.17.0       # Other CUDA versions

Hopper (x86_64)

Verified on 8×H200 GPUs:

docker run --gpus all \
  -p 8000:8000 \
  --ipc=host \
  vllm/vllm-openai:v0.17.0-cu130 moonshotai/Kimi-K2.5 \
    --tensor-parallel-size 8 \
    --mm-encoder-tp-mode data \
    --compilation_config.pass_config.fuse_allreduce_rms true \
    --tool-call-parser kimi_k2 \
    --reasoning-parser kimi_k2 \
    --enable-auto-tool-choice \
    --trust-remote-code

Blackwell (aarch64)

NVIDIA Blackwell (e.g., GB200) is also supported via the aarch64 image:

docker run --gpus all \
  -p 8000:8000 \
  --ipc=host \
  vllm/vllm-openai:v0.17.0-aarch64-cu130 moonshotai/Kimi-K2.5 \
    --tensor-parallel-size 4 \
    --mm-encoder-tp-mode data \
    --compilation_config.pass_config.fuse_allreduce_rms true \
    --tool-call-parser kimi_k2 \
    --reasoning-parser kimi_k2 \
    --enable-auto-tool-choice \
    --trust-remote-code

Installing vLLM

uv venv
source .venv/bin/activate
uv pip install vllm --torch-backend auto

Running Kimi-K2.5 with vLLM

See the following command to deploy Kimi-K2.5 with the vLLM inference server. The configuration below has been verified on 8xH200 GPUs.

vllm serve moonshotai/Kimi-K2.5 -tp 8 \
    --mm-encoder-tp-mode data \
    --compilation_config.pass_config.fuse_allreduce_rms true \
    --tool-call-parser kimi_k2 \
    --reasoning-parser kimi_k2 \
    --enable-auto-tool-choice \
    --trust-remote-code

The --reasoning-parser flag specifies the reasoning parser to use for extracting reasoning content from the model output.

Configuration Tips

  • --async-scheduling has been turned on by default to improve the overall system performance by overlapping scheduling overhead with the decoding process. If you run into issue with this feature, please try turning off this feature and file a bug report to vLLM.
  • Specifying --mm-encoder-tp-mode data deploys the vision encoder in a data-parallel fashion for better performance. This is because the vision encoder is very small, thus tensor parallelism brings little gain but incurs significant communication overhead. Enabling this feature does consume additional memory and may require adjustment on --gpu-memory-utilization.
  • If your workload involves mostly unique multimodal inputs only, it is recommended to pass --mm-processor-cache-gb 0 to avoid caching overhead. Otherwise, specifying --mm-processor-cache-type shm enables this experimental feature which utilizes host shared memory to cache preprocessed input images and/or videos which shows better performance at a high TP setting.
  • vLLM supports Expert Parallelism (EP) via --enable-expert-parallel, which allows experts in MoE models to be deployed on separate GPUs for better throughput. Check out Expert Parallelism Deployment for more details.
  • You can use benchmark_moe to perform MoE Triton kernel tuning for your hardware.

Benchmark on VisionArena-Chat Dataset

Once the server for the moonshotai/Kimi-K2.5 model is running, open another terminal and run the benchmark client:

vllm bench serve \
  --backend openai-chat \
  --endpoint /v1/chat/completions \
  --model moonshotai/Kimi-K2.5 \
  --dataset-name hf \
  --dataset-path lmarena-ai/VisionArena-Chat \
  --num-prompts 1000 \
  --request-rate 20

Consume the OpenAI API Compatible Server

import time
from openai import OpenAI

client = OpenAI(
    api_key="EMPTY",
    base_url="http://localhost:8000/v1",
    timeout=3600
)

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image_url",
                "image_url": {
                    "url": "https://ofasys-multimodal-wlcb-3-toshanghai.oss-accelerate.aliyuncs.com/wpf272043/keepme/image/receipt.png"
                }
            },
            {
                "type": "text",
                "text": "Read all the text in the image."
            }
        ]
    }
]

start = time.time()
response = client.chat.completions.create(
    model="moonshotai/Kimi-K2.5",
    messages=messages,
    max_tokens=2048
)
print(f"Response costs: {time.time() - start:.2f}s")
print(f"Generated text: {response.choices[0].message.content}")

For more usage examples, check out the vLLM user guide for multimodal models and the official Kimi-K2.5 Hugging Face page!