This guide covers advanced deployment scenarios including bare metal installations, MetalLB configuration, custom Gateway setups, and prefill/decode disaggregation.
All artifacts referenced in this guide are included in the playbook's gitops/ directory.
LLM-D requires the Gateway API which recommends Service objects with type: LoadBalancer. Cloud environments automatically provision external IPs, but bare metal clusters require MetalLB to provide this functionality.
Note: While it's possible to use
type: ClusterIPwith manual exposure methods, this is not recommended and would require a Support Exception.
Coming Soon: OpenShift Route capability for bare metal users is planned for RHOAI 3.2 (RHOAIENG-41558).
# Install MetalLB operator
oc apply -k gitops/operators/metallb-operator
# Wait for operator to be ready
oc wait --for=condition=ready pod -l control-plane=controller-manager -n metallb-system --timeout=300s# Apply base MetalLB configuration
oc apply -k gitops/instance/metallb-operator/baseBase configuration creates:
apiVersion: metallb.io/v1beta1
kind: MetalLB
metadata:
name: metallb
namespace: metallb-systemCreate an IP pool with addresses available on your network:
apiVersion: metallb.io/v1beta1
kind: IPAddressPool
metadata:
name: llm-d-pool
namespace: metallb-system
spec:
addresses:
- 192.168.1.240-192.168.1.250 # Adjust for your networkFor simple L2 (layer 2) networks:
apiVersion: metallb.io/v1beta1
kind: L2Advertisement
metadata:
name: llm-d-l2-advertisement
namespace: metallb-system
spec:
ipAddressPools:
- llm-d-poolFor environments using BGP routing:
apiVersion: metallb.io/v1beta1
kind: BGPPeer
metadata:
name: llm-d-bgp-peer
namespace: metallb-system
spec:
myASN: 64500
peerASN: 64501
peerAddress: 10.0.0.1
---
apiVersion: metallb.io/v1beta1
kind: BGPAdvertisement
metadata:
name: llm-d-bgp-advertisement
namespace: metallb-system
spec:
ipAddressPools:
- llm-d-pool# Check MetalLB pods
oc get pods -n metallb-system
# Check IP pools
oc get ipaddresspool -n metallb-system
# Verify Gateway gets external IP
oc get svc -n openshift-ingress | grep openshift-ai-inferenceWhen a Gateway is configured with allowedRoutes.namespaces.from: All, any namespace can create HTTPRoutes that attach to the Gateway. This creates a security risk:
How the Attack Works:
- LLM-D automatically creates HTTPRoutes using the
/<namespace>/<service>prefix convention - An attacker in a different namespace can create their own HTTPRoute with the same prefix
- The HTTPRoute created first generally takes precedence for routing
- Alternatively, more specific prefixes take precedence even if created later - so an HTTPRoute with
/<namespace>/<service>/v1will override/<namespace>/<service>
Example Attack:
- Legitimate service:
demo-llm/my-modelcreates HTTPRoute with prefix/demo-llm/my-model - Attacker in
evil-nscreates HTTPRoute with prefix/demo-llm/my-model/v1/chat/completions - All chat completion requests now route to the attacker's endpoint
Mitigations:
- Limit AllowedRoutes (recommended) - Restrict which namespaces can use the Gateway
- Per-Namespace Gateways - Create separate Gateway instances for each tenant
Restrict which namespaces can use the Gateway to prevent HTTPRoute hijacking:
apiVersion: gateway.networking.k8s.io/v1
kind: GatewayClass
metadata:
name: openshift-ai-inference
spec:
controllerName: openshift.io/gateway-controller/v1
---
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
name: openshift-ai-inference
namespace: openshift-ingress
spec:
gatewayClassName: openshift-ai-inference
listeners:
- name: http
port: 80
protocol: HTTP
allowedRoutes:
namespaces:
from: Selector
selector:
matchExpressions:
- key: kubernetes.io/metadata.name
operator: In
values:
- openshift-ingress
- redhat-ods-applications
- demo-llm # Add your namespaces hereapiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
labels:
istio.io/rev: openshift-gateway
name: openshift-ai-inference
namespace: openshift-ingress
spec:
gatewayClassName: openshift-ai-inference
listeners:
- name: https
port: 443
protocol: HTTPS
hostname: inference-gateway.apps.example.com
allowedRoutes:
namespaces:
from: Selector
selector:
matchExpressions:
- key: kubernetes.io/metadata.name
operator: In
values:
- openshift-ingress
- redhat-ods-applications
- demo-llm
tls:
mode: Terminate
certificateRefs:
- group: ''
kind: Secret
name: gateway-tls-secretFor multi-tenant environments, create separate Gateways per namespace:
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
name: tenant-a-gateway
namespace: tenant-a
spec:
gatewayClassName: openshift-ai-inference
listeners:
- name: http
port: 80
protocol: HTTP
allowedRoutes:
namespaces:
from: SameReference this Gateway in the LLMInferenceService:
apiVersion: serving.kserve.io/v1alpha1
kind: LLMInferenceService
metadata:
name: my-model
namespace: tenant-a
spec:
router:
gateway:
ref:
- name: tenant-a-gateway
namespace: tenant-aPrefill/Decode (P/D) disaggregation separates the compute-intensive prefill phase from the memory-bandwidth-bound decode phase for improved performance.
- High-speed networking: InfiniBand or RoCE recommended
- Multiple GPUs: Separate pools for prefill and decode
- RHOAI 2.25+: P/D support included
apiVersion: serving.kserve.io/v1alpha1
kind: LLMInferenceService
metadata:
name: my-model-pd
namespace: demo-llm
spec:
replicas: 2 # Decode replicas
model:
uri: oci://quay.io/redhat-ai-services/modelcar-catalog:llama-3-1-8b
name: meta-llama/Llama-3.1-8B-Instruct
router:
gateway: {}
route: {}
scheduler: {}
# Main template becomes "Decode" instances
template:
containers:
- name: main
env:
- name: VLLM_ADDITIONAL_ARGS
value: "--disable-uvicorn-access-log --kv-transfer-config '{\"kv_connector\":\"NixlConnector\",\"kv_role\":\"kv_both\"}' --block-size 128"
- name: VLLM_NIXL_SIDE_CHANNEL_HOST
valueFrom:
fieldRef:
fieldPath: status.podIP
resources:
limits:
nvidia.com/gpu: '1'
requests:
nvidia.com/gpu: '1'
tolerations:
- effect: NoSchedule
key: nvidia.com/gpu
operator: Exists
# Prefill instances
prefill:
replicas: 2
template:
containers:
- name: main
env:
- name: VLLM_ADDITIONAL_ARGS
value: "--disable-uvicorn-access-log --kv-transfer-config '{\"kv_connector\":\"NixlConnector\",\"kv_role\":\"kv_both\"}' --block-size 128"
- name: VLLM_NIXL_SIDE_CHANNEL_HOST
valueFrom:
fieldRef:
fieldPath: status.podIP
resources:
limits:
nvidia.com/gpu: '1'
requests:
nvidia.com/gpu: '1'
tolerations:
- effect: NoSchedule
key: nvidia.com/gpu
operator: ExistsFor optimal KV cache transfer performance:
apiVersion: serving.kserve.io/v1alpha1
kind: LLMInferenceService
metadata:
name: my-model-pd-ib
namespace: demo-llm
annotations:
k8s.v1.cni.cncf.io/networks: roce-p2 # Your RoCE network attachment
spec:
template:
containers:
- name: main
env:
- name: KSERVE_INFER_ROCE
value: "true"
- name: UCX_PROTO_INFO
value: "y" # Enable debug logging
- name: VLLM_ADDITIONAL_ARGS
value: "--disable-uvicorn-access-log --kv-transfer-config '{\"kv_connector\":\"NixlConnector\",\"kv_role\":\"kv_both\"}' --block-size 128"
- name: VLLM_NIXL_SIDE_CHANNEL_HOST
valueFrom:
fieldRef:
fieldPath: status.podIP
resources:
limits:
nvidia.com/gpu: '1'
rdma/roce_gdr: 1
requests:
nvidia.com/gpu: '1'
rdma/roce_gdr: 1
prefill:
template:
containers:
- name: main
env:
- name: KSERVE_INFER_ROCE
value: "true"
- name: UCX_PROTO_INFO
value: "y"
- name: VLLM_ADDITIONAL_ARGS
value: "--disable-uvicorn-access-log --kv-transfer-config '{\"kv_connector\":\"NixlConnector\",\"kv_role\":\"kv_both\"}' --block-size 128"
- name: VLLM_NIXL_SIDE_CHANNEL_HOST
valueFrom:
fieldRef:
fieldPath: status.podIP
resources:
limits:
nvidia.com/gpu: '1'
rdma/roce_gdr: 1
requests:
nvidia.com/gpu: '1'
rdma/roce_gdr: 1Warning: Without InfiniBand/RoCE, KV cache transfer falls back to TCP, resulting in significantly degraded performance.
For RHOAI 2.25/3.0, you must manually configure the EndpointPicker for P/D:
apiVersion: serving.kserve.io/v1alpha1
kind: LLMInferenceService
metadata:
name: my-model-pd
spec:
router:
scheduler:
endpointPickerConfig: |
apiVersion: inference.networking.x-k8s.io/v1alpha1
kind: EndpointPickerConfig
plugins:
- type: pd-profile-handler
config:
threshold: 500 # Requests below this use decode-only
- type: prefill-filter
- type: decode-filter
- type: prefix-cache-scorer
- type: load-aware-scorer
- type: max-score-picker
schedulingProfiles:
- name: prefill
plugins:
- pluginRef: prefill-filter
- pluginRef: load-aware-scorer
weight: 1.0
- pluginRef: max-score-picker
- name: decode
plugins:
- pluginRef: decode-filter
- pluginRef: prefix-cache-scorer
weight: 2.0
- pluginRef: load-aware-scorer
weight: 1.0
- pluginRef: max-score-pickerThe EndpointPicker controls how the LLM-D scheduler routes requests to vLLM instances. Understanding these plugins is essential for optimizing routing behavior.
If no endpointPickerConfig is provided, the default is:
apiVersion: inference.networking.x-k8s.io/v1alpha1
kind: EndpointPickerConfig
plugins:
- type: single-profile-handler
- type: prefix-cache-scorer
- type: load-aware-scorer
- type: max-score-picker
schedulingProfiles:
- name: default
plugins:
- pluginRef: prefix-cache-scorer
weight: 2.0
- pluginRef: load-aware-scorer
weight: 1.0
- pluginRef: max-score-pickerNote: In RHOAI 2.25/3.0, this default is static and not appropriate for P/D disaggregation. Future releases will dynamically configure based on
spec.prefillpresence.
Plugins follow a three-phase scheduling flow: Filter → Score → Pick
Handlers determine which scheduling profile to use.
| Plugin | Description | Use Case |
|---|---|---|
single-profile-handler |
Uses a single profile named default for all requests |
Standard deployments without P/D |
pd-profile-handler |
Selects prefill or decode profiles based on request |
P/D disaggregation. Supports threshold config for small requests |
prefill-header-handler |
Sets prefill profile based on request header | Advanced P/D routing |
Filters exclude endpoints that don't meet requirements.
| Plugin | Description | Use Case |
|---|---|---|
prefill-filter |
Only allows prefill-capable endpoints | P/D disaggregation prefill profile |
decode-filter |
Only allows decode-capable endpoints | P/D disaggregation decode profile |
by-label-selector |
Filters pods using Kubernetes labels | Custom endpoint selection |
Scorers rank eligible endpoints. Higher scores are preferred.
| Plugin | Description | Use Case |
|---|---|---|
prefix-cache-scorer |
Scores based on prompt prefix cache presence | Multi-turn conversations, RAG |
precise-prefix-cache-scorer |
Real-time KV-cache state tracking (more accurate) | High-throughput with strict SLOs |
load-aware-scorer |
Scores based on current load metrics | Even load distribution |
kv-cache-utilization-scorer |
Scores based on available KV cache capacity | Long-context workloads |
queue-scorer |
Scores based on queue depth/wait time | Latency-sensitive workloads |
active-request-scorer |
Scores based on active request count | Simple load balancing |
session-affinity-scorer |
Scores based on session history | Stateful conversations |
no-hit-lru-scorer |
LRU scoring for cache misses | Even cache distribution |
lora-affinity-scorer |
Scores based on loaded LoRA adapters | Multi-adapter deployments |
Pickers make the final endpoint selection.
| Plugin | Description | Use Case |
|---|---|---|
max-score-picker |
Selects highest-scoring endpoint | Deterministic "best wins" |
random-picker |
Random selection from eligible set | Testing, baseline comparison |
weighted-random-picker |
Random selection weighted by scores | Softer optimization |
The kv-cache-utilization-scorer plugin requires a workaround due to incorrect default metric configuration (RHOAIENG-41868):
apiVersion: serving.kserve.io/v1alpha1
kind: LLMInferenceService
spec:
router:
scheduler:
template:
containers:
- name: scheduler
args:
- --kv-cache-usage-percentage-metric
- vllm:kv_cache_usage_percThe precise-prefix-cache-scorer requires the scheduler to pull the tokenizer from HuggingFace, which may not work in disconnected environments.
apiVersion: serving.kserve.io/v1alpha1
kind: LLMInferenceService
metadata:
name: my-model
spec:
router:
scheduler:
template:
containers:
- name: scheduler
args:
- --kv-cache-usage-percentage-metric
- vllm:kv_cache_usage_perc
endpointPickerConfig: |
apiVersion: inference.networking.x-k8s.io/v1alpha1
kind: EndpointPickerConfig
plugins:
- type: single-profile-handler
- type: prefix-cache-scorer
- type: load-aware-scorer
- type: kv-cache-utilization-scorer
- type: max-score-picker
schedulingProfiles:
- name: default
plugins:
- pluginRef: prefix-cache-scorer
weight: 2.0
- pluginRef: load-aware-scorer
weight: 1.0
- pluginRef: kv-cache-utilization-scorer
weight: 1.5
- pluginRef: max-score-pickerFor more details, see:
For P/D disaggregation and multi-node deployments, high-speed networking is critical for KV cache transfer performance.
Reference: For detailed RoCE configuration on OpenShift, see the (PSAP) Guide to RoCE on OCP for llm-d.
Critical: When customizing
spec.template, some fields are additive (merged with defaults) while others completely replace the defaults. Misunderstanding this can break your deployment.
The spec.template section overrides values from the kserve-config-llm-template LLMInferenceServiceConfig in the redhat-ods-applications namespace.
Additive (Merged) Fields:
env- Environment variables are merged; your vars are added to defaultsvolumesandvolumeMounts- Added to existing mounts
Replacement Fields:
args- Completely replaces the default entrypoint argumentscommand- Completely replaces the default command
Why VLLM_ADDITIONAL_ARGS Exists:
Because args is a replacement field, you cannot simply add arguments to the vLLM command line via spec.template.containers[].args - doing so would replace all default arguments and break the startup.
Instead, use the VLLM_ADDITIONAL_ARGS environment variable, which is read by the entrypoint script and appended to the default arguments:
# ✅ CORRECT - Use env var
spec:
template:
containers:
- name: main
env:
- name: VLLM_ADDITIONAL_ARGS
value: "--disable-uvicorn-access-log --max-model-len=32768"
# ❌ WRONG - This replaces ALL default args
spec:
template:
containers:
- name: main
args:
- "--max-model-len=32768" # Breaks startup!Large models may require extended startup times:
spec:
template:
containers:
- name: main
livenessProbe:
initialDelaySeconds: 10
periodSeconds: 30
timeoutSeconds: 30
failureThreshold: 5
startupProbe:
httpGet:
path: /health
port: 8000
scheme: HTTPS
initialDelaySeconds: 15
timeoutSeconds: 10
periodSeconds: 10
failureThreshold: 60 # Allow up to 10 minutes for startupspec:
template:
containers:
- name: main
env:
- name: VLLM_ADDITIONAL_ARGS
value: "--disable-uvicorn-access-log --max-model-len=32768"Warning: When applying additional args through the RHOAI Dashboard UI, they will be incorrectly added to the
argssection instead ofVLLM_ADDITIONAL_ARGS, which will break the model server. Always use YAML manifests directly for custom vLLM arguments.
LLM-D does not currently support autoscaling vLLM replicas. Manual replica counts are specified via spec.replicas and spec.prefill.replicas.
Coming Soon: Autoscaling support is planned for RHOAI 3.4.
spec:
template:
containers:
- name: main
env:
- name: VLLM_ADDITIONAL_ARGS
value: "--disable-uvicorn-access-log --tensor-parallel-size=4"
resources:
limits:
nvidia.com/gpu: '4'
requests:
nvidia.com/gpu: '4'apiVersion: serving.kserve.io/v1alpha1
kind: LLMInferenceService
metadata:
name: my-model
annotations:
security.opendatahub.io/enable-auth: 'true'# Get token
TOKEN=$(oc whoami --show-token)
# Make authenticated request
curl -s http://${INFERENCE_URL}/demo-llm/my-model/v1/models \
-H "Authorization: Bearer ${TOKEN}" | jqapiVersion: serving.kserve.io/v1alpha1
kind: LLMInferenceService
metadata:
name: my-model
annotations:
security.opendatahub.io/enable-auth: 'false'Warning: Authentication is broken in RHOAI 3.0 (RHOAIENG-39326) and should be resolved in 3.2. If Connectivity Link is not installed, you must explicitly set
enable-auth: 'false'. If the annotation is omitted, it will attempt to use Connectivity Link and show errors.
Currently, there is no way to prevent an LLMInferenceService from being exposed outside the cluster. All models deployed via LLMInferenceService will be accessible through the Gateway. Use authentication (enable-auth: 'true') and network policies to control access.
Warning: KServe and OpenShift AI use "magic" annotations and labels that change deployment behavior. These are often undocumented and can cause unexpected results if set incorrectly.
Known Magic Annotations:
| Annotation | Effect |
|---|---|
security.opendatahub.io/enable-auth |
Enables/disables Connectivity Link authentication |
opendatahub.io/hardware-profile-name |
Links to a hardware profile for resource defaults |
opendatahub.io/hardware-profile-namespace |
Namespace of the hardware profile |
opendatahub.io/model-type |
Model type classification (e.g., generative) |
k8s.v1.cni.cncf.io/networks |
Attaches secondary networks (e.g., RoCE) |
Known Magic Labels:
| Label | Effect |
|---|---|
opendatahub.io/dashboard |
Makes the model visible in RHOAI Dashboard |
opendatahub.io/genai-asset |
Marks as a GenAI asset |
When troubleshooting unexpected behavior, check for these annotations/labels - they may be modifying defaults in non-obvious ways.
Configure labels for RHOAI Dashboard visibility:
apiVersion: serving.kserve.io/v1alpha1
kind: LLMInferenceService
metadata:
name: my-model
annotations:
opendatahub.io/connections: my-model
opendatahub.io/hardware-profile-name: nvidia-gpu-serving
opendatahub.io/hardware-profile-namespace: redhat-ods-applications
opendatahub.io/model-type: generative
openshift.io/display-name: My Custom Model
labels:
opendatahub.io/dashboard: "true"- Automated Deployment for GitOps patterns
- Running Benchmarks to validate performance
- Performance Debugging for optimization