Kimi Reverse Proxy is a lightweight HTTP reverse proxy for Kimi K2.5 and K2.6 models that automatically adjusts sampling parameters (temperature, top_p) and thinking mode based on whether a thinking or non-thinking model is being used. It sits between your application and the backend LLM server (e.g., vLLM). It also provides /tokenize and /v1/models endpoints with full virtual model support.
This proxy's primary purpose is to:
- Accept requests for three virtual model names (configured via
-instant-model,-thinking-model, and-preserve-thinking-model), rejecting all other model names with HTTP 400 - Set appropriate sampling parameters automatically based on the model type (Kimi K2.5/K2.6 recommended values):
- Thinking mode:
temperature=1.0,top_p=0.95 - Instant mode:
temperature=0.6,top_p=0.95 - Preserve-thinking mode:
temperature=1.0,top_p=0.95(same as thinking mode, withpreserve_thinkingenabled)
- Thinking mode:
- Configure thinking mode by setting
chat_template_kwargs.thinking:thinking=truefor thinking modelthinking=falsefor instant modelthinking=true+preserve_thinking=truefor preserve-thinking model
- Rewrite the model name to the actual backend model name before forwarding to vLLM
- Fix vLLM response bugs where non-thinking, non-streaming responses incorrectly place content in
reasoning_contentorreasoningfields instead ofcontent - Enrich
/v1/modelsendpoint by fetching backend models and exposing 3 virtual models with the same metadata - Provide a
/tokenizeendpoint that replaces virtual model names with the backend model name before forwarding to vLLM's/tokenize
Requirements: Go 1.24.2 or later
go build -o kimi-rp ../kimi-rp \
-target "http://127.0.0.1:8000" \
-served-model "your-backend-model-name" \
-instant-model "kimi-k2.6-instant" \
-thinking-model "kimi-k2.6-thinking" \
-preserve-thinking-model "kimi-k2.6-thinking-preserve"Or using environment variables:
export KIMIRP_TARGET="http://127.0.0.1:8000"
export KIMIRP_SERVED_MODEL_NAME="your-backend-model-name"
export KIMIRP_INSTANT_MODEL_NAME="kimi-k2.6-instant"
export KIMIRP_THINKING_MODEL_NAME="kimi-k2.6-thinking"
export KIMIRP_PRESERVE_THINKING_MODEL_NAME="kimi-k2.6-thinking-preserve"
./kimi-rpConfigure the proxy using command-line flags or environment variables:
| Flag | Environment Variable | Default | Description |
|---|---|---|---|
-listen |
KIMIRP_LISTEN |
0.0.0.0 |
IP address to listen on |
-port |
KIMIRP_PORT |
9000 |
Port to listen on |
-target |
KIMIRP_TARGET |
http://127.0.0.1:8000 |
Backend target URL |
-loglevel |
KIMIRP_LOGLEVEL |
INFO |
Log level (COMPLETE, DEBUG, INFO, WARN, ERROR) |
-served-model |
KIMIRP_SERVED_MODEL_NAME |
(required) | Backend model name to use in outgoing requests |
-instant-model |
KIMIRP_INSTANT_MODEL_NAME |
(required) | Name of the instant model (e.g., kimi-k2.6-instant, kimi-k2.5-instant) |
-thinking-model |
KIMIRP_THINKING_MODEL_NAME |
(required) | Name of the thinking model (e.g., kimi-k2.6-thinking, kimi-k2.5-thinking) |
-preserve-thinking-model |
KIMIRP_PRESERVE_THINKING_MODEL_NAME |
(required) | Name of the preserve-thinking model (e.g., kimi-k2.6-thinking-preserve) |
-enforce-sampling-params |
KIMIRP_ENFORCE_SAMPLING_PARAMS |
false |
Enforce sampling parameters, overriding client-provided values |
By default, the proxy only sets sampling parameters if they are not already present in the request. When -enforce-sampling-params is enabled, the proxy will always override client-provided sampling parameters with the predefined values for the detected mode.
The preserve-thinking virtual model behaves like the thinking model but also injects preserve_thinking: true into chat_template_kwargs. This preserves full reasoning content across multi-turn interactions and enhances performance in coding agent scenarios. See the Kimi K2.6 model card for details.
GET /v1/models: Enriched (fetches backend models, validates served model, exposes 3 virtual models)POST /v1/chat/completions: Transformed (sampling params + thinking mode applied)POST /v1/completions: Model name validated and swapped (no sampling params or thinking mode — raw prompt completions bypass the chat template)POST /tokenize: Replaces virtual model names with backend model name and forwards to vLLM's/tokenize- All other paths: Passed through unchanged to the backend
For full functionality with thinking mode and tool calls using the Chat Completions API, the vLLM backend should be started with the following flags:
--reasoning-parser=qwen3 # Required for thinking/reasoning mode
--enable-auto-tool-choice --tool-call-parser=qwen3_coder # Required for tool/function callsThe proxy provides a /tokenize endpoint that forwards tokenization requests to vLLM's /tokenize. The proxy replaces virtual model names with the backend served model name, then forwards the request body unchanged. Two modes:
{"prompt": "..."}— raw text tokenization, forwarded as-is. No chat template is applied.{"messages": [...], "tools": [...]}— vLLM applies the model's chat template (apply_chat_template) then tokenizes the result. Messages and tools must be in Chat Completions API format.
GET /health: Returns{"status":"healthy"}for Docker health checks
The proxy supports the following log levels:
| Level | Description |
|---|---|
COMPLETE |
Most verbose - includes full HTTP request/response dumps |
DEBUG |
Debug information including parameter application details |
INFO |
General operational information |
WARN |
Warning messages |
ERROR |
Error messages only |
When set to COMPLETE, the proxy will log full HTTP request and response bodies, which is useful for debugging but very verbose.
COMPLETE log level will expose all this data in plaintext. Only enable it in secure, non-production environments or ensure logs are properly secured and retained temporarily.
The proxy includes native systemd support for production deployments:
- Type:
notify- The proxy signals readiness to systemd automatically - Status Updates: Sends periodic status updates to systemd showing processed request counts
- Graceful Shutdown: Properly signals systemd when stopping
- Journald Logging: Structured logging output is compatible with journald
Example systemd unit file:
[Unit]
Description=Kimi Reverse Proxy
After=network.target
[Service]
Type=notify
User=kimi-rp
Group=kimi-rp
ExecStart=/usr/local/bin/kimi-rp -served-model "your-backend-model" -instant-model "kimi-k2.6-instant" -thinking-model "kimi-k2.6-thinking" -preserve-thinking-model "kimi-k2.6-thinking-preserve"
Restart=on-failure
Environment=KIMIRP_LOGLEVEL=INFO
[Install]
WantedBy=multi-user.targetkimi-rp). Never run as root. Create the user with:
sudo useradd --system --no-create-home --shell /usr/sbin/nologin kimi-rp
sudo chown kimi-rp:kimi-rp /usr/local/bin/kimi-rpThe server supports graceful shutdown with a 3-minute timeout to allow in-flight requests to complete. Send SIGINT or SIGTERM to initiate shutdown. When running under systemd, the proxy will automatically signal the service manager when ready and during shutdown.
MIT License - see LICENSE file for details.