Configuration

This document describes all configuration options available in the llm-d KV Cache libraries. All configurations are JSON-serializable.

Main Configuration

This package consists of two components:

KV Cache Indexer: Manages the KV cache index, allowing efficient retrieval of cached blocks.
KV Event Processing: Handles events from vLLM to update the cache index.

See the Architecture Overview for a high-level view of how these components work and interact.

The two components are configured separately, but share both the index backend for storing KV block localities and the token processor for converting tokens into blocks. The token processor is configured via the tokenProcessorConfig field in the main configuration. The index backend is configured via the kvBlockIndexConfig field in the KV Cache Indexer configuration.

Main Configuration Structure

The main configuration structure for the llm-d KV Cache system.

{
  "indexerConfig": { ... },
  "kvEventsConfig": { ... },
  "tokenProcessorConfig": { ... }
}

Field	Type	Description	Default
`indexerConfig`	IndexerConfig	Configuration for the KV Cache Indexer module	See defaults
`kvEventsConfig`	KVEventsConfig	Configuration for the KV Event Processing pool	See defaults
`tokenProcessorConfig`	TokenProcessorConfig	Configuration for token processing	See defaults

KV-Cache Indexer Configuration

Indexer Configuration (`Config`)

The indexer configuration structure for the KV Cache Indexer module.

{
  "kvBlockIndexConfig": { ... },
  "tokenizersPoolConfig": { ... },
  "kvCacheBackendConfigs": { ... }
}

Field	Type	Description	Default
`kvBlockIndexConfig`	IndexConfig	Configuration for KV block indexing	See defaults
`tokenizersPoolConfig`	Config	Configuration for tokenization pool	See defaults
`kvCacheBackendConfigs`	KVCacheBackendConfig	Configuration for KV Cache Device Backends	See defaults

Complete Example Configuration

Here's a complete configuration example with all options:

{
  "kvBlockIndexConfig": {
    "inMemoryConfig": {
      "size": 100000000,
      "podCacheSize": 10
    },
    "enableMetrics": true,
    "metricsLoggingInterval": "1m0s"
  },
  "tokenizersPoolConfig": {
    "modelName": "namespace/model-name",
    "workersCount": 8,
    "hf": {
      "huggingFaceToken": "your_hf_token_here",
      "tokenizersCacheDir": "/tmp/tokenizers"
    },
    "local": {
      "autoDiscoveryDir": "/mnt/models",
      "autoDiscoveryTokenizerFileName": "tokenizer.json"
    }
  },
  "kvCacheBackendConfigs": [
    {
      "name": "gpu",
      "weight": 1.0
    },
    {
      "name": "cpu",
      "weight": 0.8
    }
  ]
}

KV-Block Index Configuration

Index Configuration (`IndexConfig`)

Configures the KV-block index backend. Multiple backends can be configured, but only the first available one will be used.

{
  "inMemoryConfig": { ... },
  "costAwareMemoryConfig": { ... },
  "redisConfig": { ... },
  "enableMetrics": false
}

Field	Type	Description	Default
`inMemoryConfig`	InMemoryIndexConfig	In-memory index configuration	See defaults
`costAwareMemoryConfig`	CostAwareMemoryIndexConfig	Cost-aware memory index configuration	`null`
`redisConfig`	RedisIndexConfig	Redis index configuration	`null`
`enableMetrics`	`boolean`	Enable admissions/evictions/hits/misses recording	`false`
`metricsLoggingInterval`	`string` (duration)	Interval at which metrics are logged (e.g., `"1m0s"`). If zero or omitted, metrics logging is disabled. Requires `enableMetrics` to be `true`.	`"0s"`

In-Memory Index Configuration (`InMemoryIndexConfig`)

Configures the in-memory KV block index implementation.

{
  "size": 100000000,
  "podCacheSize": 10
}

Field	Type	Description	Default
`size`	`integer`	Maximum number of keys that can be stored	`100000000`
`podCacheSize`	`integer`	Maximum number of pod entries per key	`10`

Cost-Aware Memory Index Configuration (`CostAwareMemoryIndexConfig`)

Configures the cost-aware memory-based KV block index implementation using Ristretto cache.

{
  "size": "2GiB"
}

Field	Type	Description	Default
`size`	`string`	Maximum memory size for the cache. Supports human-readable formats like "2GiB", "500MiB", "1GB", etc.	`"2GiB"`

Redis Index Configuration (`RedisIndexConfig`)

Configures the Redis-backed KV block index implementation.

{
  "address": "redis://127.0.0.1:6379"
}

Field	Type	Description	Default
`address`	`string`	Redis server address (can include auth: `redis://user:pass@host:port/db`)	`"redis://127.0.0.1:6379"`
`backendType`	`string`	Backend type: "redis" or "valkey" (optional, mainly for documentation)	`"redis"`
`enableRDMA`	`boolean`	Enable RDMA transport for Valkey (experimental, requires Valkey with RDMA support)	`false`

Valkey Index Configuration (`RedisIndexConfig`)

Configures the Valkey-backed KV block index implementation. Valkey is a Redis-compatible, open-source alternative that supports RDMA for improved latency.

{
  "address": "valkey://127.0.0.1:6379",
  "backendType": "valkey", 
  "enableRDMA": false
}

Field	Type	Description	Default
`address`	`string`	Valkey server address. Supports `valkey://`, `valkeys://` (SSL), `redis://`, or plain address	`"valkey://127.0.0.1:6379"`
`backendType`	`string`	Should be "valkey" for Valkey instances	`"valkey"`
`enableRDMA`	`boolean`	Enable RDMA transport (requires Valkey server with RDMA support)	`false`

Note: Both Redis and Valkey configurations use the same RedisIndexConfig structure since Valkey is API-compatible with Redis.

Tokenization Configuration

Tokenization Pool Configuration (`Config`)

Configures the tokenization worker pool and cache utilization strategy.

{
  "modelName": "namespace/model-name",
  "workersCount": 5,
  "hf": {
    "enabled": true,
    "huggingFaceToken": "",
    "tokenizersCacheDir": ""
  },
  "local": {
    "autoDiscoveryDir": "/mnt/models",
    "autoDiscoveryTokenizerFileName": "tokenizer.json",
    "modelTokenizerMap": {
      "my-model": "/path/to/custom-model/tokenizer.json"
    }
  }
}

Field	Type	Description	Default
`modelName`	`string`	Base model name for the tokenizer.
`workersCount`	`integer`	Number of tokenization worker goroutines	`5`
`hf`	`HFTokenizerConfig`	HuggingFace tokenizer config
`local`	`LocalTokenizerConfig`	Local tokenizer config

Local Tokenizer Configuration (`LocalTokenizerConfig`)

Configures loading tokenizers from local files. Useful for air-gapped environments or when models are pre-loaded.

{
  "autoDiscoveryDir": "/mnt/models",
  "autoDiscoveryTokenizerFileName": "tokenizer.json",
  "modelTokenizerMap": {
    "my-model": "/path/to/custom-model/tokenizer.json"
  }
}

Field	Type	Description	Default
`autoDiscoveryDir`	`string`	Directory to recursively scan for tokenizer files. Can be set via `LOCAL_TOKENIZER_DIR` environment variable.	`"/mnt/models"`
`autoDiscoveryTokenizerFileName`	`string`	Filename to search for during auto-discovery. Can be set via `LOCAL_TOKENIZER_FILENAME` environment variable.	`"tokenizer.json"`
`modelTokenizerMap`	`map[string]string`	Manual mapping from model name to tokenizer file path. Overrides auto-discovered model mappings.	`{}`

Auto-Discovery Behavior:

When autoDiscoveryDir is set, the system recursively scans the directory for files matching autoDiscoveryTokenizerFileName. It supports two directory structure patterns:

HuggingFace Cache Structure (automatically detected):

~/.cache/huggingface/hub/
  models--Qwen--Qwen3-0.6B/snapshots/{hash}/tokenizer.json
    → Model name: "Qwen/Qwen3-0.6B"
  models--meta-llama--Llama-2-7b-chat-hf/snapshots/{hash}/tokenizer.json
    → Model name: "meta-llama/Llama-2-7b-chat-hf"

Custom Directory Structure (arbitrary nesting):

/mnt/models/
  llama-7b/tokenizer.json
    → Model name: "llama-7b"
  Qwen/Qwen3/tokenizer.json
    → Model name: "Qwen/Qwen3"
  org/team/model/tokenizer.json
    → Model name: "org/team/model"

Environment Variables:

LOCAL_TOKENIZER_DIR: Overrides the default auto-discovery directory
LOCAL_TOKENIZER_FILENAME: Overrides the default tokenizer filename

HuggingFace Tokenizer Configuration (`HFTokenizerConfig`)

Configures the HuggingFace tokenizer backend for downloading tokenizers from HuggingFace Hub.

{
  "enabled": true,
  "huggingFaceToken": "",
  "tokenizersCacheDir": "./bin"
}

Field	Type	Description	Default
`enabled`	`boolean`	Enable HuggingFace tokenizer backend	`true`
`huggingFaceToken`	`string`	HuggingFace API token for accessing private models	`""`
`tokenizersCacheDir`	`string`	Local directory for caching downloaded tokenizers	`"./bin"`

Note: The system uses a composite tokenizer by default that tries local tokenizers first, then falls back to HuggingFace tokenizers if enabled and the model is not found locally.

KV Cache Backend Tiers

KV Cache Backend Configuration (`KVCacheBackendConfig`)

Configures the available device backends which store the KV Cache blocks. This will be used in scoring.

{
  [
    {
      "name": "gpu",
      "weight": 1.0,
    },
    {
      "name": "cpu",
      "weight": 0.8,
    }
  ]
}

KV-Event Processing Configuration

KV-Event Pool Configuration (`Config`)

Configures the ZMQ event processing pool for handling KV cache events. The pool supports two modes:

Static Endpoint Mode: Connects to a single ZMQ endpoint
Auto-Discovery Mode (default): Automatically discovers and subscribes to per-pod ZMQ endpoints

{
  "topicFilter": "kv@",
  "concurrency": 16,
  "discoverPods": true
}

Field	Type	Description	Default
`zmqEndpoint`	`string`	ZMQ address to connect to	`""`
`topicFilter`	`string`	ZMQ subscription filter	`"kv@"`
`concurrency`	`integer`	Number of parallel workers	`4`
`engineType`	`string`	Inference engine adapter type (`"vllm"` or `"sglang"`)	`"vllm"`
`discoverPods`	`boolean`	Enable Kubernetes pod reconciler for automatic per-pod subscriber management	`true`
`podDiscoveryConfig`	PodDiscoveryConfig	Configuration for pod reconciler (only used when `discoverPods` is true)	`null`

Static Endpoint Mode Example

For connecting to a single ZMQ endpoint:

{
  "zmqEndpoint": "tcp://indexer:5557",
  "topicFilter": "kv@",
  "concurrency": 8,
  "engineType": "vllm",
  "discoverPods": false
}

The zmqEndpoint field specifies the local ZMQ socket address to bind to.

Auto-Discovery Mode Example

For automatic Kubernetes pod discovery:

{
  "topicFilter": "kv@",
  "concurrency": 8,
  "discoverPods": true,
  "podDiscoveryConfig": {
    "podLabelSelector": "llm-d.ai/inferenceServing=true",
    "podNamespace": "inference",
    "socketPort": 5557,
  }
}

Pod Discovery Configuration (`PodDiscoveryConfig`)

Configures the Kubernetes pod reconciler for automatic per-pod ZMQ subscriber management. The reconciler watches Kubernetes pods and dynamically creates/removes ZMQ subscribers based on pod lifecycle.

{
  "podLabelSelector": "llm-d.ai/inferenceServing=true",
  "podNamespace": "",
  "socketPort": 5556,
}

Field	Type	Description	Default
`podLabelSelector`	`string`	Label selector for filtering which pods to watch. Examples: `"app=vllm"`, `"app=vllm,tier=gpu"`	`"llm-d.ai/inferenceServing=true"`
`podNamespace`	`string`	Namespace to watch pods in. If empty, watches all namespaces (requires cluster-wide RBAC)	`""` (all namespaces)
`socketPort`	`integer`	Port number where vLLM pods expose their ZMQ socket	`5557`

Pod Requirements

For the reconciler to create a subscriber for a pod, the pod must meet these conditions:

Match label selector: Pod labels must match the configured podLabelSelector
Running state: pod.Status.Phase == Running
Has IP address: pod.Status.PodIP != ""
Ready condition: Pod has condition PodReady == ConditionTrue

When any of these conditions becomes false, the subscriber is automatically removed.

RBAC Requirements

When using the pod reconciler, ensure the service account has appropriate RBAC permissions:

Namespace-scoped (when podNamespace is set):

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: kv-cache-manager
  namespace: inference
rules:
- apiGroups: [""]
  resources: ["pods"]
  verbs: ["get", "list", "watch"]

Cluster-wide (when podNamespace is empty):

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: kv-cache-manager
rules:
- apiGroups: [""]
  resources: ["pods"]
  verbs: ["get", "list", "watch"]

Event Processing Configuration Example

For the ZMQ event processing pool:

{
  "zmqEndpoint": "tcp://indexer:5557",
  "topicFilter": "kv@",
  "concurrency": 8
}

Token Processing Configuration

Token Processor Configuration (`TokenProcessorConfig`)

Configures how tokens are converted to KV-block keys.

{
  "blockSize": 16,
  "hashSeed": ""
}

Field	Type	Description	Default
`blockSize`	`integer`	Number of tokens per block	`16`
`hashSeed`	`string`	Seed for hash generation (should align with vLLM's PYTHONHASHSEED)	`""`

Notes

Hash Seed Alignment: The hashSeed in TokenProcessorConfig should be aligned with vLLM's PYTHONHASHSEED environment variable to ensure consistent hashing across the system.
Memory Considerations:
- The size parameter in InMemoryIndexConfig directly affects memory usage. Each key-value pair consumes memory proportional to the number of associated pods.
- The size parameter in CostAwareMemoryIndexConfig controls the maximum memory footprint and supports human-readable formats (e.g., "2GiB", "500MiB", "1GB").
Performance Tuning:
- Increase workersCount in tokenization config for higher tokenization throughput
- Adjust concurrency in event processing for better event handling performance
- Tune cache sizes based on available memory and expected workload
Cache Directories: If used, ensure the tokenizersCacheDir has sufficient disk space and appropriate permissions for the application to read/write tokenizer files.
Redis Configuration: When using Redis backend, ensure Redis server is accessible and has sufficient memory. The address field supports full Redis URLs including authentication: redis://user:pass@host:port/db.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Configuration

Main Configuration

Main Configuration Structure

KV-Cache Indexer Configuration

Indexer Configuration (`Config`)

Complete Example Configuration

KV-Block Index Configuration

Index Configuration (`IndexConfig`)

In-Memory Index Configuration (`InMemoryIndexConfig`)

Cost-Aware Memory Index Configuration (`CostAwareMemoryIndexConfig`)

Redis Index Configuration (`RedisIndexConfig`)

Valkey Index Configuration (`RedisIndexConfig`)

Tokenization Configuration

Tokenization Pool Configuration (`Config`)

Local Tokenizer Configuration (`LocalTokenizerConfig`)

HuggingFace Tokenizer Configuration (`HFTokenizerConfig`)

KV Cache Backend Tiers

KV Cache Backend Configuration (`KVCacheBackendConfig`)

KV-Event Processing Configuration

KV-Event Pool Configuration (`Config`)

Static Endpoint Mode Example

Auto-Discovery Mode Example

Pod Discovery Configuration (`PodDiscoveryConfig`)

Pod Requirements

RBAC Requirements

Event Processing Configuration Example

Token Processing Configuration

Token Processor Configuration (`TokenProcessorConfig`)

Notes

FilesExpand file tree

configuration.md

Latest commit

History

configuration.md

File metadata and controls

Configuration

Main Configuration

Main Configuration Structure

KV-Cache Indexer Configuration

Indexer Configuration (Config)

Complete Example Configuration

KV-Block Index Configuration

Index Configuration (IndexConfig)

In-Memory Index Configuration (InMemoryIndexConfig)

Cost-Aware Memory Index Configuration (CostAwareMemoryIndexConfig)

Redis Index Configuration (RedisIndexConfig)

Valkey Index Configuration (RedisIndexConfig)

Tokenization Configuration

Tokenization Pool Configuration (Config)

Local Tokenizer Configuration (LocalTokenizerConfig)

HuggingFace Tokenizer Configuration (HFTokenizerConfig)

KV Cache Backend Tiers

KV Cache Backend Configuration (KVCacheBackendConfig)

KV-Event Processing Configuration

KV-Event Pool Configuration (Config)

Static Endpoint Mode Example

Auto-Discovery Mode Example

Pod Discovery Configuration (PodDiscoveryConfig)

Pod Requirements

RBAC Requirements

Event Processing Configuration Example

Token Processing Configuration

Token Processor Configuration (TokenProcessorConfig)

Notes

Indexer Configuration (`Config`)

Index Configuration (`IndexConfig`)

In-Memory Index Configuration (`InMemoryIndexConfig`)

Cost-Aware Memory Index Configuration (`CostAwareMemoryIndexConfig`)

Redis Index Configuration (`RedisIndexConfig`)

Valkey Index Configuration (`RedisIndexConfig`)

Tokenization Pool Configuration (`Config`)

Local Tokenizer Configuration (`LocalTokenizerConfig`)

HuggingFace Tokenizer Configuration (`HFTokenizerConfig`)

KV Cache Backend Configuration (`KVCacheBackendConfig`)

KV-Event Pool Configuration (`Config`)

Pod Discovery Configuration (`PodDiscoveryConfig`)

Token Processor Configuration (`TokenProcessorConfig`)