This document describes all configuration options available in the llm-d KV Cache Manager. All configurations are JSON-serializable.
This package consists of two components:
- KV Cache Indexer: Manages the KV cache index, allowing efficient retrieval of cached blocks.
- KV Event Processing: Handles events from vLLM to update the cache index.
See the Architecture Overview for a high-level view of how these components work and interact.
The two components are configured separately, but share the index backend for storing KV block localities.
The latter is configured via the kvBlockIndexConfig field in the KV Cache Indexer configuration.
The main configuration structure for the KV Cache Indexer module.
{
"prefixStoreConfig": { ... },
"tokenProcessorConfig": { ... },
"kvBlockIndexConfig": { ... },
"tokenizersPoolConfig": { ... }
}| Field | Type | Description | Default |
|---|---|---|---|
prefixStoreConfig |
LRUStoreConfig | Configuration for the prefix store | See defaults |
tokenProcessorConfig |
TokenProcessorConfig | Configuration for token processing | See defaults |
kvBlockIndexConfig |
IndexConfig | Configuration for KV block indexing | See defaults |
tokenizersPoolConfig |
Config | Configuration for tokenization pool | See defaults |
Here's a complete configuration example with all options:
{
"prefixStoreConfig": {
"cacheSize": 500000,
"blockSize": 256
},
"tokenProcessorConfig": {
"blockSize": 16,
"hashSeed": "12345"
},
"kvBlockIndexConfig": {
"inMemoryConfig": {
"size": 100000000,
"podCacheSize": 10
},
"enableMetrics": true,
"metricsLoggingInterval": "1m0s"
},
"tokenizersPoolConfig": {
"workersCount": 8,
"minPrefixOverlapRatio": 0.85,
"huggingFaceToken": "your_hf_token_here",
"tokenizersCacheDir": "/tmp/tokenizers"
}
}Configures the KV-block index backend. Multiple backends can be configured, but only the first available one will be used.
{
"inMemoryConfig": { ... },
"costAwareMemoryConfig": { ... },
"redisConfig": { ... },
"enableMetrics": false
}| Field | Type | Description | Default |
|---|---|---|---|
inMemoryConfig |
InMemoryIndexConfig | In-memory index configuration | See defaults |
costAwareMemoryConfig |
CostAwareMemoryIndexConfig | Cost-aware memory index configuration | null |
redisConfig |
RedisIndexConfig | Redis index configuration | null |
enableMetrics |
boolean |
Enable admissions/evictions/hits/misses recording | false |
metricsLoggingInterval |
string (duration) |
Interval at which metrics are logged (e.g., "1m0s"). If zero or omitted, metrics logging is disabled. Requires enableMetrics to be true. |
"0s" |
Configures the in-memory KV block index implementation.
{
"size": 100000000,
"podCacheSize": 10
}| Field | Type | Description | Default |
|---|---|---|---|
size |
integer |
Maximum number of keys that can be stored | 100000000 |
podCacheSize |
integer |
Maximum number of pod entries per key | 10 |
Configures the cost-aware memory-based KV block index implementation using Ristretto cache.
{
"size": "2GiB"
}| Field | Type | Description | Default |
|---|---|---|---|
size |
string |
Maximum memory size for the cache. Supports human-readable formats like "2GiB", "500MiB", "1GB", etc. | "2GiB" |
Configures the Redis-backed KV block index implementation.
{
"address": "redis://127.0.0.1:6379"
}| Field | Type | Description | Default |
|---|---|---|---|
address |
string |
Redis server address (can include auth: redis://user:pass@host:port/db) |
"redis://127.0.0.1:6379" |
Configures how tokens are converted to KV-block keys.
{
"blockSize": 16,
"hashSeed": ""
}| Field | Type | Description | Default |
|---|---|---|---|
blockSize |
integer |
Number of tokens per block | 16 |
hashSeed |
string |
Seed for hash generation (should align with vLLM's PYTHONHASHSEED) | "" |
Configures the LRU-based prefix token store.
{
"cacheSize": 500000,
"blockSize": 256
}| Field | Type | Description | Default |
|---|---|---|---|
cacheSize |
integer |
Maximum number of blocks the LRU cache can store | 500000 |
blockSize |
integer |
Number of characters per block in the tokenization prefix-cache | 256 |
Configures the tokenization worker pool and cache utilization strategy.
{
"workersCount": 5,
"minPrefixOverlapRatio": 0.8,
"huggingFaceToken": "",
"tokenizersCacheDir": ""
}| Field | Type | Description | Default |
|---|---|---|---|
workersCount |
integer |
Number of tokenization worker goroutines | 5 |
minPrefixOverlapRatio |
float64 |
Minimum overlap ratio to use cached prefix tokens (0.0-1.0) | 0.8 |
huggingFaceToken |
string |
HuggingFace authentication token | "" |
tokenizersCacheDir |
string |
Directory for caching tokenizers | "" |
Configures the HuggingFace tokenizer backend.
{
"huggingFaceToken": "",
"tokenizersCacheDir": ""
}| Field | Type | Description | Default |
|---|---|---|---|
huggingFaceToken |
string |
HuggingFace API token for accessing models | "" |
tokenizersCacheDir |
string |
Local directory for caching downloaded tokenizers | "./bin" |
Configures the ZMQ event processing pool for handling KV cache events.
{
"zmqEndpoint": "tcp://*:5557",
"topicFilter": "kv@",
"concurrency": 4
}For the ZMQ event processing pool:
{
"zmqEndpoint": "tcp://indexer:5557",
"topicFilter": "kv@",
"concurrency": 8
}| Field | Type | Description | Default |
|---|---|---|---|
zmqEndpoint |
string |
ZMQ address to connect to | "tcp://*:5557" |
topicFilter |
string |
ZMQ subscription filter | "kv@" |
concurrency |
integer |
Number of parallel workers | 4 |
-
Hash Seed Alignment: The
hashSeedinTokenProcessorConfigshould be aligned with vLLM'sPYTHONHASHSEEDenvironment variable to ensure consistent hashing across the system. -
Memory Considerations:
- The
sizeparameter inInMemoryIndexConfigdirectly affects memory usage. Each key-value pair consumes memory proportional to the number of associated pods. - The
sizeparameter inCostAwareMemoryIndexConfigcontrols the maximum memory footprint and supports human-readable formats (e.g., "2GiB", "500MiB", "1GB").
- The
-
Performance Tuning:
- Increase
workersCountin tokenization config for higher tokenization throughput - Adjust
minPrefixOverlapRatio: lower values accept shorter cached prefixes, reducing full tokenization overhead - Adjust
concurrencyin event processing for better event handling performance - Tune cache sizes based on available memory and expected workload
- Increase
-
Cache Directories: If used, ensure the
tokenizersCacheDirhas sufficient disk space and appropriate permissions for the application to read/write tokenizer files. -
Redis Configuration: When using Redis backend, ensure Redis server is accessible and has sufficient memory. The
addressfield supports full Redis URLs including authentication:redis://user:pass@host:port/db.