kimi-rp

Kimi Reverse Proxy is a lightweight HTTP reverse proxy for Kimi K2.5 and K2.6 models that automatically adjusts sampling parameters (temperature, top_p) and thinking mode based on whether a thinking or non-thinking model is being used. It sits between your application and the backend LLM server (e.g., vLLM). It also provides /tokenize and /v1/models endpoints with full virtual model support.

Core Functionality

This proxy's primary purpose is to:

Accept requests for three virtual model names (configured via -instant-model, -thinking-model, and -preserve-thinking-model), rejecting all other model names with HTTP 400
Set appropriate sampling parameters automatically based on the model type (Kimi K2.5/K2.6 recommended values):
- Thinking mode: temperature=1.0, top_p=0.95
- Instant mode: temperature=0.6, top_p=0.95
- Preserve-thinking mode: temperature=1.0, top_p=0.95 (same as thinking mode, with preserve_thinking enabled)
Configure thinking mode by setting chat_template_kwargs.thinking:
- thinking=true for thinking model
- thinking=false for instant model
- thinking=true + preserve_thinking=true for preserve-thinking model
Rewrite the model name to the actual backend model name before forwarding to vLLM
Fix vLLM response bugs where non-thinking, non-streaming responses incorrectly place content in reasoning_content or reasoning fields instead of content
Enrich /v1/models endpoint by fetching backend models and exposing 3 virtual models with the same metadata
Provide a /tokenize endpoint that replaces virtual model names with the backend model name before forwarding to vLLM's /tokenize

Installation

Requirements: Go 1.24.2 or later

go build -o kimi-rp .

Usage

./kimi-rp \
  -target "http://127.0.0.1:8000" \
  -served-model "your-backend-model-name" \
  -instant-model "kimi-k2.6-instant" \
  -thinking-model "kimi-k2.6-thinking" \
  -preserve-thinking-model "kimi-k2.6-thinking-preserve"

Or using environment variables:

export KIMIRP_TARGET="http://127.0.0.1:8000"
export KIMIRP_SERVED_MODEL_NAME="your-backend-model-name"
export KIMIRP_INSTANT_MODEL_NAME="kimi-k2.6-instant"
export KIMIRP_THINKING_MODEL_NAME="kimi-k2.6-thinking"
export KIMIRP_PRESERVE_THINKING_MODEL_NAME="kimi-k2.6-thinking-preserve"
./kimi-rp

Configuration

Configure the proxy using command-line flags or environment variables:

Flag	Environment Variable	Default	Description
`-listen`	`KIMIRP_LISTEN`	`0.0.0.0`	IP address to listen on
`-port`	`KIMIRP_PORT`	`9000`	Port to listen on
`-target`	`KIMIRP_TARGET`	`http://127.0.0.1:8000`	Backend target URL
`-loglevel`	`KIMIRP_LOGLEVEL`	`INFO`	Log level (COMPLETE, DEBUG, INFO, WARN, ERROR)
`-served-model`	`KIMIRP_SERVED_MODEL_NAME`	(required)	Backend model name to use in outgoing requests
`-instant-model`	`KIMIRP_INSTANT_MODEL_NAME`	(required)	Name of the instant model (e.g., `kimi-k2.6-instant`, `kimi-k2.5-instant`)
`-thinking-model`	`KIMIRP_THINKING_MODEL_NAME`	(required)	Name of the thinking model (e.g., `kimi-k2.6-thinking`, `kimi-k2.5-thinking`)
`-preserve-thinking-model`	`KIMIRP_PRESERVE_THINKING_MODEL_NAME`	(required)	Name of the preserve-thinking model (e.g., `kimi-k2.6-thinking-preserve`)
`-enforce-sampling-params`	`KIMIRP_ENFORCE_SAMPLING_PARAMS`	`false`	Enforce sampling parameters, overriding client-provided values

Enforce Sampling Parameters

By default, the proxy only sets sampling parameters if they are not already present in the request. When -enforce-sampling-params is enabled, the proxy will always override client-provided sampling parameters with the predefined values for the detected mode.

Preserve Thinking Model

The preserve-thinking virtual model behaves like the thinking model but also injects preserve_thinking: true into chat_template_kwargs. This preserves full reasoning content across multi-turn interactions and enhances performance in coding agent scenarios. See the Kimi K2.6 model card for details.

Request Routing

GET /v1/models: Enriched (fetches backend models, validates served model, exposes 3 virtual models)
POST /v1/chat/completions: Transformed (sampling params + thinking mode applied)
POST /v1/completions: Model name validated and swapped (no sampling params or thinking mode — raw prompt completions bypass the chat template)
POST /tokenize: Replaces virtual model names with backend model name and forwards to vLLM's /tokenize
All other paths: Passed through unchanged to the backend

vLLM Backend Requirements

For full functionality with thinking mode and tool calls using the Chat Completions API, the vLLM backend should be started with the following flags:

--reasoning-parser=qwen3                                  # Required for thinking/reasoning mode
--enable-auto-tool-choice --tool-call-parser=qwen3_coder  # Required for tool/function calls

Tokenize API

The proxy provides a /tokenize endpoint that forwards tokenization requests to vLLM's /tokenize. The proxy replaces virtual model names with the backend served model name, then forwards the request body unchanged. Two modes:

{"prompt": "..."} — raw text tokenization, forwarded as-is. No chat template is applied.
{"messages": [...], "tools": [...]} — vLLM applies the model's chat template (apply_chat_template) then tokenizes the result. Messages and tools must be in Chat Completions API format.

Health Check

GET /health: Returns {"status":"healthy"} for Docker health checks

Log Levels

The proxy supports the following log levels:

Level	Description
`COMPLETE`	Most verbose - includes full HTTP request/response dumps
`DEBUG`	Debug information including parameter application details
`INFO`	General operational information
`WARN`	Warning messages
`ERROR`	Error messages only

When set to COMPLETE, the proxy will log full HTTP request and response bodies, which is useful for debugging but very verbose.

⚠️ Privacy Warning: LLM requests often contain sensitive or personal data (conversation history, personal information, confidential content). The COMPLETE log level will expose all this data in plaintext. Only enable it in secure, non-production environments or ensure logs are properly secured and retained temporarily.

systemd Integration

The proxy includes native systemd support for production deployments:

Type: notify - The proxy signals readiness to systemd automatically
Status Updates: Sends periodic status updates to systemd showing processed request counts
Graceful Shutdown: Properly signals systemd when stopping
Journald Logging: Structured logging output is compatible with journald

Example systemd unit file:

[Unit]
Description=Kimi Reverse Proxy
After=network.target

[Service]
Type=notify
User=kimi-rp
Group=kimi-rp
ExecStart=/usr/local/bin/kimi-rp -served-model "your-backend-model" -instant-model "kimi-k2.6-instant" -thinking-model "kimi-k2.6-thinking" -preserve-thinking-model "kimi-k2.6-thinking-preserve"
Restart=on-failure
Environment=KIMIRP_LOGLEVEL=INFO

[Install]
WantedBy=multi-user.target

⚠️ Security Best Practice: Always run the proxy under a dedicated, unprivileged user account (e.g., kimi-rp). Never run as root. Create the user with:

sudo useradd --system --no-create-home --shell /usr/sbin/nologin kimi-rp
sudo chown kimi-rp:kimi-rp /usr/local/bin/kimi-rp

Graceful Shutdown

The server supports graceful shutdown with a 3-minute timeout to allow in-flight requests to complete. Send SIGINT or SIGTERM to initiate shutdown. When running under systemd, the proxy will automatically signal the service manager when ready and during shutdown.

License

MIT License - see LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
.github/workflows		.github/workflows
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
completions.go		completions.go
config.go		config.go
go.mod		go.mod
go.sum		go.sum
helpers.go		helpers.go
main.go		main.go
models.go		models.go
passthrough.go		passthrough.go
tokenize.go		tokenize.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

kimi-rp

Core Functionality

Installation

Usage

Configuration

Enforce Sampling Parameters

Preserve Thinking Model

Request Routing

vLLM Backend Requirements

Tokenize API

Health Check

Log Levels

systemd Integration

Graceful Shutdown

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

kimi-rp

Core Functionality

Installation

Usage

Configuration

Enforce Sampling Parameters

Preserve Thinking Model

Request Routing

vLLM Backend Requirements

Tokenize API

Health Check

Log Levels

systemd Integration

Graceful Shutdown

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages