This document describes all configuration options available in the llm-d KV Cache libraries. All configurations are JSON-serializable.
This package consists of two components:
- KV Cache Indexer: Manages the KV cache index, allowing efficient retrieval of cached blocks.
- KV Event Processing: Handles events from vLLM to update the cache index.
See the Architecture Overview for a high-level view of how these components work and interact.
The two components are configured separately, but share both the index backend for storing KV block localities and the token processor for converting tokens into blocks.
The token processor is configured via the tokenProcessorConfig field in the main configuration.
The index backend is configured via the kvBlockIndexConfig field in the KV Cache Indexer configuration.
The main configuration structure for the llm-d KV Cache system.
{
"indexerConfig": { ... },
"kvEventsConfig": { ... },
"tokenProcessorConfig": { ... }
}| Field | Type | Description | Default |
|---|---|---|---|
indexerConfig |
IndexerConfig | Configuration for the KV Cache Indexer module | See defaults |
kvEventsConfig |
KVEventsConfig | Configuration for the KV Event Processing pool | See defaults |
tokenProcessorConfig |
TokenProcessorConfig | Configuration for token processing | See defaults |
The indexer configuration structure for the KV Cache Indexer module.
{
"kvBlockIndexConfig": { ... },
"tokenizersPoolConfig": { ... },
"kvCacheBackendConfigs": { ... }
}| Field | Type | Description | Default |
|---|---|---|---|
kvBlockIndexConfig |
IndexConfig | Configuration for KV block indexing | See defaults |
tokenizersPoolConfig |
Config | Configuration for tokenization pool | See defaults |
kvCacheBackendConfigs |
KVCacheBackendConfig | Configuration for KV Cache Device Backends | See defaults |
Here's a complete configuration example with all options:
{
"kvBlockIndexConfig": {
"inMemoryConfig": {
"size": 100000000,
"podCacheSize": 10
},
"enableMetrics": true,
"metricsLoggingInterval": "1m0s"
},
"tokenizersPoolConfig": {
"modelName": "namespace/model-name",
"workersCount": 8,
"hf": {
"huggingFaceToken": "your_hf_token_here",
"tokenizersCacheDir": "/tmp/tokenizers"
},
"local": {
"autoDiscoveryDir": "/mnt/models",
"autoDiscoveryTokenizerFileName": "tokenizer.json"
}
},
"kvCacheBackendConfigs": [
{
"name": "gpu",
"weight": 1.0
},
{
"name": "cpu",
"weight": 0.8
}
]
}Configures the KV-block index backend. Multiple backends can be configured, but only the first available one will be used.
{
"inMemoryConfig": { ... },
"costAwareMemoryConfig": { ... },
"redisConfig": { ... },
"enableMetrics": false
}| Field | Type | Description | Default |
|---|---|---|---|
inMemoryConfig |
InMemoryIndexConfig | In-memory index configuration | See defaults |
costAwareMemoryConfig |
CostAwareMemoryIndexConfig | Cost-aware memory index configuration | null |
redisConfig |
RedisIndexConfig | Redis index configuration | null |
enableMetrics |
boolean |
Enable admissions/evictions/hits/misses recording | false |
metricsLoggingInterval |
string (duration) |
Interval at which metrics are logged (e.g., "1m0s"). If zero or omitted, metrics logging is disabled. Requires enableMetrics to be true. |
"0s" |
Configures the in-memory KV block index implementation.
{
"size": 100000000,
"podCacheSize": 10
}| Field | Type | Description | Default |
|---|---|---|---|
size |
integer |
Maximum number of keys that can be stored | 100000000 |
podCacheSize |
integer |
Maximum number of pod entries per key | 10 |
Configures the cost-aware memory-based KV block index implementation using Ristretto cache.
{
"size": "2GiB"
}| Field | Type | Description | Default |
|---|---|---|---|
size |
string |
Maximum memory size for the cache. Supports human-readable formats like "2GiB", "500MiB", "1GB", etc. | "2GiB" |
Configures the Redis-backed KV block index implementation.
{
"address": "redis://127.0.0.1:6379"
}| Field | Type | Description | Default |
|---|---|---|---|
address |
string |
Redis server address (can include auth: redis://user:pass@host:port/db) |
"redis://127.0.0.1:6379" |
backendType |
string |
Backend type: "redis" or "valkey" (optional, mainly for documentation) | "redis" |
enableRDMA |
boolean |
Enable RDMA transport for Valkey (experimental, requires Valkey with RDMA support) | false |
Configures the Valkey-backed KV block index implementation. Valkey is a Redis-compatible, open-source alternative that supports RDMA for improved latency.
{
"address": "valkey://127.0.0.1:6379",
"backendType": "valkey",
"enableRDMA": false
}| Field | Type | Description | Default |
|---|---|---|---|
address |
string |
Valkey server address. Supports valkey://, valkeys:// (SSL), redis://, or plain address |
"valkey://127.0.0.1:6379" |
backendType |
string |
Should be "valkey" for Valkey instances | "valkey" |
enableRDMA |
boolean |
Enable RDMA transport (requires Valkey server with RDMA support) | false |
Note: Both Redis and Valkey configurations use the same RedisIndexConfig structure since Valkey is API-compatible with Redis.
Configures the tokenization worker pool and cache utilization strategy.
{
"modelName": "namespace/model-name",
"workersCount": 5,
"hf": {
"enabled": true,
"huggingFaceToken": "",
"tokenizersCacheDir": ""
},
"local": {
"autoDiscoveryDir": "/mnt/models",
"autoDiscoveryTokenizerFileName": "tokenizer.json",
"modelTokenizerMap": {
"my-model": "/path/to/custom-model/tokenizer.json"
}
}
}| Field | Type | Description | Default |
|---|---|---|---|
modelName |
string |
Base model name for the tokenizer. | |
workersCount |
integer |
Number of tokenization worker goroutines | 5 |
hf |
HFTokenizerConfig |
HuggingFace tokenizer config | |
local |
LocalTokenizerConfig |
Local tokenizer config |
Configures loading tokenizers from local files. Useful for air-gapped environments or when models are pre-loaded.
{
"autoDiscoveryDir": "/mnt/models",
"autoDiscoveryTokenizerFileName": "tokenizer.json",
"modelTokenizerMap": {
"my-model": "/path/to/custom-model/tokenizer.json"
}
}| Field | Type | Description | Default |
|---|---|---|---|
autoDiscoveryDir |
string |
Directory to recursively scan for tokenizer files. Can be set via LOCAL_TOKENIZER_DIR environment variable. |
"/mnt/models" |
autoDiscoveryTokenizerFileName |
string |
Filename to search for during auto-discovery. Can be set via LOCAL_TOKENIZER_FILENAME environment variable. |
"tokenizer.json" |
modelTokenizerMap |
map[string]string |
Manual mapping from model name to tokenizer file path. Overrides auto-discovered model mappings. | {} |
Auto-Discovery Behavior:
When autoDiscoveryDir is set, the system recursively scans the directory for files matching autoDiscoveryTokenizerFileName. It supports two directory structure patterns:
-
HuggingFace Cache Structure (automatically detected):
~/.cache/huggingface/hub/ models--Qwen--Qwen3-0.6B/snapshots/{hash}/tokenizer.json → Model name: "Qwen/Qwen3-0.6B" models--meta-llama--Llama-2-7b-chat-hf/snapshots/{hash}/tokenizer.json → Model name: "meta-llama/Llama-2-7b-chat-hf" -
Custom Directory Structure (arbitrary nesting):
/mnt/models/ llama-7b/tokenizer.json → Model name: "llama-7b" Qwen/Qwen3/tokenizer.json → Model name: "Qwen/Qwen3" org/team/model/tokenizer.json → Model name: "org/team/model"
Environment Variables:
LOCAL_TOKENIZER_DIR: Overrides the default auto-discovery directoryLOCAL_TOKENIZER_FILENAME: Overrides the default tokenizer filename
Configures the HuggingFace tokenizer backend for downloading tokenizers from HuggingFace Hub.
{
"enabled": true,
"huggingFaceToken": "",
"tokenizersCacheDir": "./bin"
}| Field | Type | Description | Default |
|---|---|---|---|
enabled |
boolean |
Enable HuggingFace tokenizer backend | true |
huggingFaceToken |
string |
HuggingFace API token for accessing private models | "" |
tokenizersCacheDir |
string |
Local directory for caching downloaded tokenizers | "./bin" |
Note: The system uses a composite tokenizer by default that tries local tokenizers first, then falls back to HuggingFace tokenizers if enabled and the model is not found locally.
Configures the available device backends which store the KV Cache blocks. This will be used in scoring.
{
[
{
"name": "gpu",
"weight": 1.0,
},
{
"name": "cpu",
"weight": 0.8,
}
]
}Configures the ZMQ event processing pool for handling KV cache events. The pool supports two modes:
- Static Endpoint Mode: Connects to a single ZMQ endpoint
- Auto-Discovery Mode (default): Automatically discovers and subscribes to per-pod ZMQ endpoints
{
"topicFilter": "kv@",
"concurrency": 16,
"discoverPods": true
}| Field | Type | Description | Default |
|---|---|---|---|
zmqEndpoint |
string |
ZMQ address to connect to | "" |
topicFilter |
string |
ZMQ subscription filter | "kv@" |
concurrency |
integer |
Number of parallel workers | 4 |
engineType |
string |
Inference engine adapter type ("vllm" or "sglang") |
"vllm" |
discoverPods |
boolean |
Enable Kubernetes pod reconciler for automatic per-pod subscriber management | true |
podDiscoveryConfig |
PodDiscoveryConfig | Configuration for pod reconciler (only used when discoverPods is true) |
null |
For connecting to a single ZMQ endpoint:
{
"zmqEndpoint": "tcp://indexer:5557",
"topicFilter": "kv@",
"concurrency": 8,
"engineType": "vllm",
"discoverPods": false
}The zmqEndpoint field specifies the local ZMQ socket address to bind to.
For automatic Kubernetes pod discovery:
{
"topicFilter": "kv@",
"concurrency": 8,
"discoverPods": true,
"podDiscoveryConfig": {
"podLabelSelector": "llm-d.ai/inferenceServing=true",
"podNamespace": "inference",
"socketPort": 5557,
}
}Configures the Kubernetes pod reconciler for automatic per-pod ZMQ subscriber management. The reconciler watches Kubernetes pods and dynamically creates/removes ZMQ subscribers based on pod lifecycle.
{
"podLabelSelector": "llm-d.ai/inferenceServing=true",
"podNamespace": "",
"socketPort": 5556,
}| Field | Type | Description | Default |
|---|---|---|---|
podLabelSelector |
string |
Label selector for filtering which pods to watch. Examples: "app=vllm", "app=vllm,tier=gpu" |
"llm-d.ai/inferenceServing=true" |
podNamespace |
string |
Namespace to watch pods in. If empty, watches all namespaces (requires cluster-wide RBAC) | "" (all namespaces) |
socketPort |
integer |
Port number where vLLM pods expose their ZMQ socket | 5557 |
For the reconciler to create a subscriber for a pod, the pod must meet these conditions:
- Match label selector: Pod labels must match the configured
podLabelSelector - Running state:
pod.Status.Phase == Running - Has IP address:
pod.Status.PodIP != "" - Ready condition: Pod has condition
PodReady == ConditionTrue
When any of these conditions becomes false, the subscriber is automatically removed.
When using the pod reconciler, ensure the service account has appropriate RBAC permissions:
Namespace-scoped (when podNamespace is set):
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: kv-cache-manager
namespace: inference
rules:
- apiGroups: [""]
resources: ["pods"]
verbs: ["get", "list", "watch"]Cluster-wide (when podNamespace is empty):
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: kv-cache-manager
rules:
- apiGroups: [""]
resources: ["pods"]
verbs: ["get", "list", "watch"]For the ZMQ event processing pool:
{
"zmqEndpoint": "tcp://indexer:5557",
"topicFilter": "kv@",
"concurrency": 8
}Configures how tokens are converted to KV-block keys.
{
"blockSize": 16,
"hashSeed": ""
}| Field | Type | Description | Default |
|---|---|---|---|
blockSize |
integer |
Number of tokens per block | 16 |
hashSeed |
string |
Seed for hash generation (should align with vLLM's PYTHONHASHSEED) | "" |
-
Hash Seed Alignment: The
hashSeedinTokenProcessorConfigshould be aligned with vLLM'sPYTHONHASHSEEDenvironment variable to ensure consistent hashing across the system. -
Memory Considerations:
- The
sizeparameter inInMemoryIndexConfigdirectly affects memory usage. Each key-value pair consumes memory proportional to the number of associated pods. - The
sizeparameter inCostAwareMemoryIndexConfigcontrols the maximum memory footprint and supports human-readable formats (e.g., "2GiB", "500MiB", "1GB").
- The
-
Performance Tuning:
- Increase
workersCountin tokenization config for higher tokenization throughput - Adjust
concurrencyin event processing for better event handling performance - Tune cache sizes based on available memory and expected workload
- Increase
-
Cache Directories: If used, ensure the
tokenizersCacheDirhas sufficient disk space and appropriate permissions for the application to read/write tokenizer files. -
Redis Configuration: When using Redis backend, ensure Redis server is accessible and has sufficient memory. The
addressfield supports full Redis URLs including authentication:redis://user:pass@host:port/db.