End-to-end guide: deploy vLLM + Prometheus + Grafana in K8s, port-forward to view dashboards locally, and validate with a benchmark.
# Create HF token secret and PVCs
export HF_TOKEN=<your-token>
kubectl create secret generic hf-token --from-literal=HF_TOKEN="$HF_TOKEN"
kubectl apply -f docs/deployment/vllm-pvc.yaml
# Deploy vLLM
kubectl apply -f docs/deployment/vllm-storage.yamlWait for the pod to become ready (model download + startup takes a few minutes):
kubectl wait --for=condition=ready pod -l app=vllm-storage --timeout=600s# Prometheus — scrapes vLLM metrics every 15s
kubectl apply -f docs/deployment/monitoring/prometheus.yaml
# Grafana dashboard ConfigMap (must be applied before Grafana)
kubectl apply -f docs/deployment/monitoring/grafana-dashboard-configmap.yaml
# Grafana — pre-configured with Prometheus datasource and the dashboard
kubectl apply -f docs/deployment/monitoring/grafana.yamlOpen two terminals:
# Terminal 1: Grafana UI on http://localhost:3000
kubectl port-forward svc/grafana-svc 3000:3000# Terminal 2: Prometheus UI on http://localhost:9090 (optional, for ad-hoc queries)
kubectl port-forward svc/prometheus-svc 9090:9090Open the dashboard directly at:
http://localhost:3000/d/vllm-kv-offload/vllm-kv-offload-dashboard
Anonymous access is enabled so no login is needed.
Port-forward the vLLM service and run the benchmark:
# Terminal 3: vLLM API on http://localhost:8000
kubectl port-forward svc/vllm-storage-svc 8000:8000Run two benchmark iterations to test KV cache offload (write) and retrieval (read):
Run 1: KV Cache Write/Offload Test
vllm bench serve \
--backend vllm \
--base-url http://localhost:8000 \
--model Qwen/Qwen3-32B \
--dataset-name prefix_repetition \
--prefix-repetition-prefix-len 16384 \
--prefix-repetition-suffix-len 0 \
--prefix-repetition-num-prefixes 100 \
--prefix-repetition-output-len 5 \
--num-prompts 100 \
--max-concurrency 40 \
--request-rate 40 \
--burstiness 1 \
--ignore-eos \
--seed 42Run 2: KV Cache Read/Retrieval Test
vllm bench serve \
--backend vllm \
--base-url http://localhost:8000 \
--model Qwen/Qwen3-32B \
--dataset-name prefix_repetition \
--prefix-repetition-prefix-len 16384 \
--prefix-repetition-suffix-len 0 \
--prefix-repetition-num-prefixes 100 \
--prefix-repetition-output-len 5 \
--num-prompts 100 \
--max-concurrency 40 \
--request-rate 40 \
--burstiness 1 \
--ignore-eos \
--seed 42Watch the Grafana dashboard — you should see KV offload metrics (throughput, transfer rates, bytes offloaded) once the cache starts spilling to storage during the benchmark.
curl -s http://localhost:8000/metrics | grep kv_offloadRemove all monitoring and vLLM resources:
# Monitoring stack
kubectl delete -f docs/deployment/monitoring/grafana.yaml
kubectl delete -f docs/deployment/monitoring/grafana-dashboard-configmap.yaml
kubectl delete -f docs/deployment/monitoring/prometheus.yaml
# vLLM
kubectl delete -f docs/deployment/vllm-storage.yaml
kubectl delete -f docs/deployment/vllm-pvc.yaml
kubectl delete secret hf-token