This playbook provides comprehensive guidance for deploying, operating, and troubleshooting LLM-D (Distributed LLM Inference) on Red Hat OpenShift AI.
LLM-D enables intelligent routing and distributed inference for Large Language Models. It provides significant performance improvements over naive load balancing through:
- Prefix-aware routing: Routes requests to replicas with cached prefixes, improving KV cache hit rates from ~25% to 90%+
- Prefill/Decode disaggregation: Separates compute-intensive prefill from memory-bandwidth-bound decode phases
- Load-aware scheduling: Balances traffic based on real-time metrics from vLLM instances
This playbook is self-contained with all deployment artifacts included.
| Guide | Description |
|---|---|
| Pre-flight Validation | Verify cluster readiness and prerequisites |
| Quick Start | Connected environment deployment in minutes |
| Advanced Deployment | Bare metal, MetalLB, custom configurations |
| Automated Deployment | GitOps and automation patterns |
| Disconnected Installs | Air-gapped and restricted network deployments |
| Running Benchmarks | Performance testing with GuideLLM |
| Performance Debugging | Diagnosing and resolving performance issues |
| Directory | Contents |
|---|---|
gitops/operators/ |
Operator installation manifests (MetalLB, Service Mesh, RHOAI, etc.) |
gitops/instance/ |
Instance configurations (LLM-D, Gateway, monitoring, GuideLLM) |
gitops/ocp-4.19/ |
OCP 4.19 prerequisites and configs |
gitops/ocp-4.18/ |
OCP 4.18 prerequisites (experimental) |
gitops/disconnected/ |
ImageSetConfigurations for air-gapped installs |
monitoring/ |
Prometheus and Grafana stack for metrics |
vllm/ |
Vanilla vLLM deployment for baseline comparison |
llm-d/ |
LLM-D deployment configurations |
guidellm/ |
GuideLLM benchmark configurations and overlays |
benchmark-job/ |
Kubernetes job templates for benchmarking |
assets/ |
Screenshots and images for documentation |
- OpenShift: 4.19+
- OpenShift AI: 2.25+ (3.0+ recommended)
- GPU: NVIDIA GPU with appropriate drivers
- Role:
cluster-admin
Install in this order:
- Cert Manager
- MetalLB (bare metal only)
- Service Mesh 3
- Connectivity Link (RHOAI 3.0+)
- Red Hat OpenShift AI
- Node Feature Discovery
- NVIDIA GPU Operator
- LeaderWorkerSet: Required only for large MoE models with expert parallelism
┌─────────────────────────────────────────────────────────────────┐
│ Gateway API │
│ (openshift-ai-inference) │
└─────────────────────────────┬───────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ EPP (Scheduler) │
│ - Prefix-aware scoring │
│ - Load-aware routing │
│ - KV cache utilization │
└─────────────────────────────┬───────────────────────────────────┘
│
┌─────────────────┼─────────────────┐
▼ ▼ ▼
┌───────────────────┐ ┌───────────────────┐ ┌───────────────────┐
│ vLLM Replica 1 │ │ vLLM Replica 2 │ │ vLLM Replica N │
│ (KV Cache) │ │ (KV Cache) │ │ (KV Cache) │
└───────────────────┘ └───────────────────┘ └───────────────────┘
- Red Hat OpenShift AI Documentation
- LLM-D GitHub Repository
- Gateway API Inference Extension
- vLLM Documentation
This playbook consolidates learnings from real-world LLM-D implementations. Please contribute updates as the tooling evolves or new lessons are learned.