KV Cache Event Synchronization is a feature that enables multiple vLLM instances to share key-value cache states through ZMQ-based event publishing. This improves prefix cache hit rates and reduces redundant computation by allowing the AIBrix gateway to make intelligent routing decisions based on real-time cache state.
The KV event synchronization system consists of:
- vLLM Instances: Publish KV cache events via ZMQ pub/sub pattern
- AIBrix Cache: Manages subscriptions and processes events
- Sync Prefix Cache Indexer: Maintains global prefix cache state
- Gateway Router: Uses cache state for intelligent routing decisions
vLLM Pod 1 ─────┐
├─── ZMQ Events ───► KV Event Manager ───► Sync Indexer ───► Gateway Router
vLLM Pod N ─────┘ (in Cache)
The system uses a two-stage initialization:
- Cache Initialization: Using
InitWithOptionspattern withEnableKVSync=true - KV Event Manager: Automatically created when conditions are met
- vLLM version 0.7.0 or later with KV cache events support
- AIBrix gateway-plugins built with ZMQ support (
-tags="zmq") - ZMQ library (libzmq3-dev) installed on gateway nodes
- Remote tokenizer enabled (strict prerequisite)
- Redis client configured (for production deployments)
Important
KV event sync has a strict dependency on remote tokenizer to ensure consistent tokenization between gateway and vLLM instances. The system will not initialize if remote tokenizer is disabled.
| Variable | Default | Description |
|---|---|---|
AIBRIX_PREFIX_CACHE_KV_EVENT_SYNC_ENABLED |
false |
Enable KV event synchronization |
AIBRIX_PREFIX_CACHE_USE_REMOTE_TOKENIZER |
false |
Must be true for KV sync |
AIBRIX_PREFIX_CACHE_REMOTE_TOKENIZER_ENDPOINT |
vLLM service endpoint | |
AIBRIX_PREFIX_CACHE_LOCAL_ROUTER_METRICS_ENABLED |
false |
Enable prefix cache metrics |
| Label | Value | Description |
|---|---|---|
model.aibrix.ai/kv-events-enabled |
true |
Enable KV events for this pod |
model.aibrix.ai/lora-id |
string | LoRA adapter ID (optional) |
Add these arguments to your vLLM container:
args:
- --enable-kv-cache-events
- --kv-events-publisher=zmq
- --kv-events-endpoint=tcp://*:5557
- --kv-events-replay-endpoint=tcp://*:5558
- --kv-events-buffer-steps=10000Add corresponding ports:
ports:
- name: kv-events
containerPort: 5557
protocol: TCP
- name: kv-replay
containerPort: 5558
protocol: TCPEnable Remote Tokenizer (mandatory prerequisite):
kubectl set env deployment/aibrix-gateway-plugins -n aibrix-system \ AIBRIX_PREFIX_CACHE_USE_REMOTE_TOKENIZER=true \ AIBRIX_PREFIX_CACHE_REMOTE_TOKENIZER_ENDPOINT=http://vllm-service:8000
Enable KV Event Sync:
kubectl set env deployment/aibrix-gateway-plugins -n aibrix-system \ AIBRIX_PREFIX_CACHE_KV_EVENT_SYNC_ENABLED=true
Enable Prefix Cache Metrics (optional but recommended):
kubectl set env deployment/aibrix-gateway-plugins -n aibrix-system \ AIBRIX_PREFIX_CACHE_LOCAL_ROUTER_METRICS_ENABLED=true
Deploy vLLM with KV Events:
apiVersion: apps/v1 kind: Deployment metadata: name: vllm-model spec: template: metadata: labels: model.aibrix.ai/name: "llama-7b" model.aibrix.ai/kv-events-enabled: "true" spec: containers: - name: vllm args: - --enable-kv-cache-events - --kv-events-publisher=zmq - --kv-events-endpoint=tcp://*:5557 - --kv-events-replay-endpoint=tcp://*:5558
AIBrix uses conditional compilation to manage ZMQ dependencies:
Components requiring ZMQ support:
gateway-plugins: Main component for KV event synckvcache-watcher: Optional component for cache monitoring
Build commands:
# Build with ZMQ support
go build -tags="zmq" ./cmd/plugins/main.go
# Docker build with ZMQ
make docker-build-gateway-plugins # Automatically includes ZMQComponents that do NOT require ZMQ:
controller-manager: Uses default buildmetadata-service: Uses default buildruntime: Python component, no ZMQ needed
Published when new KV cache blocks are stored:
type BlockStoredEvent struct {
BlockHashes []int64 // Hash values of stored blocks
TokenIDs [][]byte // Token IDs for each block (each token is a big-endian uint32)
ModelName string // Model identifier
LoraID int64 // LoRA adapter ID (-1 if none)
SourcePod string // Source pod name
ParentBlockHash *int64 // Hash value of the parent block or nil
}Published when blocks are removed from cache:
type BlockRemovedEvent struct {
BlockHashes []int64 // Hash values of removed blocks
ModelName string // Model identifier
LoraID int64 // LoRA adapter ID
SourcePod string // Source pod name
}Check initialization logs:
kubectl logs deployment/aibrix-gateway-plugins -n aibrix-system | grep -E "KV event|initialize cache"
Verify remote tokenizer:
# Must see both enabled kubectl get deployment/aibrix-gateway-plugins -n aibrix-system -o yaml | grep -A2 "REMOTE_TOKENIZER\|KV_EVENT_SYNC"
Check vLLM logs:
kubectl logs deployment/vllm-model | grep "KV cache events"
Verify ZMQ connectivity:
kubectl exec -it <gateway-pod> -n aibrix-system -- nc -zv <vllm-pod-ip> 5557
Check ZMQ build support:
kubectl exec <gateway-pod> -n aibrix-system -- ldd /app/gateway-plugin | grep zmq
Verify pod labels:
kubectl get pods -l model.aibrix.ai/kv-events-enabled=true
Check network policies:
- Ensure ports 5557-5558 are accessible
- No blocking NetworkPolicies
Validate tokenizer:
kubectl exec <gateway-pod> -- curl http://tokenizer:8080/health
- High Memory Usage: Reduce buffer steps in vLLM
- Event Processing Lag: Adjust batch size and polling timeout
- Network Overhead: ~1MB/s per pod at high load
Add labels:
kubectl label deployment vllm-model model.aibrix.ai/kv-events-enabled=true
Update deployment with KV event args (see Configuration section)
Restart pods:
kubectl rollout restart deployment vllm-model
To disable KV event sync:
# Disable in gateway kubectl set env deployment/aibrix-gateway-plugins -n aibrix-system \ AIBRIX_PREFIX_CACHE_KV_EVENT_SYNC_ENABLED=false # Remove from vLLM deployments kubectl label deployment vllm-model model.aibrix.ai/kv-events-enabled-
- Deployment Order:
- Enable remote tokenizer first and verify it's working
- Deploy vLLM with KV events configuration
- Enable KV sync in gateway last
- Monitoring:
- Enable prefix cache metrics for visibility
- Monitor ZMQ connection status in logs
- Track prefix cache hit rates in Grafana
- Resource Planning:
- ZMQ traffic: ~1MB/s per vLLM pod at high load
- Memory: Sync indexer uses ~64 bytes per prefix entry
- CPU: Minimal overhead (<1% per pod)
- Production Considerations:
- Use dedicated network for ZMQ traffic if possible
- Configure appropriate timeouts based on network latency
- Plan for graceful degradation if KV sync fails