Advanced Deployment Guide

This guide covers advanced deployment scenarios including bare metal installations, MetalLB configuration, custom Gateway setups, and prefill/decode disaggregation.

All artifacts referenced in this guide are included in the playbook's gitops/ directory.

Bare Metal Deployments

Why MetalLB is Required

LLM-D requires the Gateway API which recommends Service objects with type: LoadBalancer. Cloud environments automatically provision external IPs, but bare metal clusters require MetalLB to provide this functionality.

Note: While it's possible to use type: ClusterIP with manual exposure methods, this is not recommended and would require a Support Exception.

Coming Soon: OpenShift Route capability for bare metal users is planned for RHOAI 3.2 (RHOAIENG-41558).

Installing MetalLB Operator

# Install MetalLB operator
oc apply -k gitops/operators/metallb-operator

# Wait for operator to be ready
oc wait --for=condition=ready pod -l control-plane=controller-manager -n metallb-system --timeout=300s

Configure MetalLB Instance

# Apply base MetalLB configuration
oc apply -k gitops/instance/metallb-operator/base

Base configuration creates:

apiVersion: metallb.io/v1beta1
kind: MetalLB
metadata:
  name: metallb
  namespace: metallb-system

Configure IP Address Pool

Create an IP pool with addresses available on your network:

apiVersion: metallb.io/v1beta1
kind: IPAddressPool
metadata:
  name: llm-d-pool
  namespace: metallb-system
spec:
  addresses:
    - 192.168.1.240-192.168.1.250  # Adjust for your network

Configure L2 Advertisement

For simple L2 (layer 2) networks:

apiVersion: metallb.io/v1beta1
kind: L2Advertisement
metadata:
  name: llm-d-l2-advertisement
  namespace: metallb-system
spec:
  ipAddressPools:
    - llm-d-pool

Configure BGP Advertisement (Advanced)

For environments using BGP routing:

apiVersion: metallb.io/v1beta1
kind: BGPPeer
metadata:
  name: llm-d-bgp-peer
  namespace: metallb-system
spec:
  myASN: 64500
  peerASN: 64501
  peerAddress: 10.0.0.1
---
apiVersion: metallb.io/v1beta1
kind: BGPAdvertisement
metadata:
  name: llm-d-bgp-advertisement
  namespace: metallb-system
spec:
  ipAddressPools:
    - llm-d-pool

Verify MetalLB Configuration

# Check MetalLB pods
oc get pods -n metallb-system

# Check IP pools
oc get ipaddresspool -n metallb-system

# Verify Gateway gets external IP
oc get svc -n openshift-ingress | grep openshift-ai-inference

Custom Gateway Configuration

HTTPRoute Hijacking Threat Model

When a Gateway is configured with allowedRoutes.namespaces.from: All, any namespace can create HTTPRoutes that attach to the Gateway. This creates a security risk:

How the Attack Works:

LLM-D automatically creates HTTPRoutes using the /<namespace>/<service> prefix convention
An attacker in a different namespace can create their own HTTPRoute with the same prefix
The HTTPRoute created first generally takes precedence for routing
Alternatively, more specific prefixes take precedence even if created later - so an HTTPRoute with /<namespace>/<service>/v1 will override /<namespace>/<service>

Example Attack:

Legitimate service: demo-llm/my-model creates HTTPRoute with prefix /demo-llm/my-model
Attacker in evil-ns creates HTTPRoute with prefix /demo-llm/my-model/v1/chat/completions
All chat completion requests now route to the attacker's endpoint

Mitigations:

Limit AllowedRoutes (recommended) - Restrict which namespaces can use the Gateway
Per-Namespace Gateways - Create separate Gateway instances for each tenant

Basic Gateway with Namespace Restrictions

Restrict which namespaces can use the Gateway to prevent HTTPRoute hijacking:

apiVersion: gateway.networking.k8s.io/v1
kind: GatewayClass
metadata:
  name: openshift-ai-inference
spec:
  controllerName: openshift.io/gateway-controller/v1
---
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: openshift-ai-inference
  namespace: openshift-ingress
spec:
  gatewayClassName: openshift-ai-inference
  listeners:
    - name: http
      port: 80
      protocol: HTTP
      allowedRoutes:
        namespaces:
          from: Selector
          selector:
            matchExpressions:
              - key: kubernetes.io/metadata.name
                operator: In
                values:
                  - openshift-ingress
                  - redhat-ods-applications
                  - demo-llm  # Add your namespaces here

HTTPS Gateway with TLS

apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  labels:
    istio.io/rev: openshift-gateway
  name: openshift-ai-inference
  namespace: openshift-ingress
spec:
  gatewayClassName: openshift-ai-inference
  listeners:
    - name: https
      port: 443
      protocol: HTTPS
      hostname: inference-gateway.apps.example.com
      allowedRoutes:
        namespaces:
          from: Selector
          selector:
            matchExpressions:
              - key: kubernetes.io/metadata.name
                operator: In
                values:
                  - openshift-ingress
                  - redhat-ods-applications
                  - demo-llm
      tls:
        mode: Terminate
        certificateRefs:
          - group: ''
            kind: Secret
            name: gateway-tls-secret

Per-Namespace Gateway (HTTPRoute Hijacking Mitigation)

For multi-tenant environments, create separate Gateways per namespace:

apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: tenant-a-gateway
  namespace: tenant-a
spec:
  gatewayClassName: openshift-ai-inference
  listeners:
    - name: http
      port: 80
      protocol: HTTP
      allowedRoutes:
        namespaces:
          from: Same

Reference this Gateway in the LLMInferenceService:

apiVersion: serving.kserve.io/v1alpha1
kind: LLMInferenceService
metadata:
  name: my-model
  namespace: tenant-a
spec:
  router:
    gateway:
      ref:
        - name: tenant-a-gateway
          namespace: tenant-a

Prefill/Decode Disaggregation

Prefill/Decode (P/D) disaggregation separates the compute-intensive prefill phase from the memory-bandwidth-bound decode phase for improved performance.

Requirements

High-speed networking: InfiniBand or RoCE recommended
Multiple GPUs: Separate pools for prefill and decode
RHOAI 2.25+: P/D support included

Basic P/D Configuration

apiVersion: serving.kserve.io/v1alpha1
kind: LLMInferenceService
metadata:
  name: my-model-pd
  namespace: demo-llm
spec:
  replicas: 2  # Decode replicas
  model:
    uri: oci://quay.io/redhat-ai-services/modelcar-catalog:llama-3-1-8b
    name: meta-llama/Llama-3.1-8B-Instruct
  router:
    gateway: {}
    route: {}
    scheduler: {}
  # Main template becomes "Decode" instances
  template:
    containers:
      - name: main
        env:
          - name: VLLM_ADDITIONAL_ARGS
            value: "--disable-uvicorn-access-log --kv-transfer-config '{\"kv_connector\":\"NixlConnector\",\"kv_role\":\"kv_both\"}' --block-size 128"
          - name: VLLM_NIXL_SIDE_CHANNEL_HOST
            valueFrom:
              fieldRef:
                fieldPath: status.podIP
        resources:
          limits:
            nvidia.com/gpu: '1'
          requests:
            nvidia.com/gpu: '1'
    tolerations:
      - effect: NoSchedule
        key: nvidia.com/gpu
        operator: Exists
  # Prefill instances
  prefill:
    replicas: 2
    template:
      containers:
        - name: main
          env:
            - name: VLLM_ADDITIONAL_ARGS
              value: "--disable-uvicorn-access-log --kv-transfer-config '{\"kv_connector\":\"NixlConnector\",\"kv_role\":\"kv_both\"}' --block-size 128"
            - name: VLLM_NIXL_SIDE_CHANNEL_HOST
              valueFrom:
                fieldRef:
                  fieldPath: status.podIP
          resources:
            limits:
              nvidia.com/gpu: '1'
            requests:
              nvidia.com/gpu: '1'
      tolerations:
        - effect: NoSchedule
          key: nvidia.com/gpu
          operator: Exists

P/D with InfiniBand/RoCE

For optimal KV cache transfer performance:

apiVersion: serving.kserve.io/v1alpha1
kind: LLMInferenceService
metadata:
  name: my-model-pd-ib
  namespace: demo-llm
  annotations:
    k8s.v1.cni.cncf.io/networks: roce-p2  # Your RoCE network attachment
spec:
  template:
    containers:
      - name: main
        env:
          - name: KSERVE_INFER_ROCE
            value: "true"
          - name: UCX_PROTO_INFO
            value: "y"  # Enable debug logging
          - name: VLLM_ADDITIONAL_ARGS
            value: "--disable-uvicorn-access-log --kv-transfer-config '{\"kv_connector\":\"NixlConnector\",\"kv_role\":\"kv_both\"}' --block-size 128"
          - name: VLLM_NIXL_SIDE_CHANNEL_HOST
            valueFrom:
              fieldRef:
                fieldPath: status.podIP
        resources:
          limits:
            nvidia.com/gpu: '1'
            rdma/roce_gdr: 1
          requests:
            nvidia.com/gpu: '1'
            rdma/roce_gdr: 1
  prefill:
    template:
      containers:
        - name: main
          env:
            - name: KSERVE_INFER_ROCE
              value: "true"
            - name: UCX_PROTO_INFO
              value: "y"
            - name: VLLM_ADDITIONAL_ARGS
              value: "--disable-uvicorn-access-log --kv-transfer-config '{\"kv_connector\":\"NixlConnector\",\"kv_role\":\"kv_both\"}' --block-size 128"
            - name: VLLM_NIXL_SIDE_CHANNEL_HOST
              valueFrom:
                fieldRef:
                  fieldPath: status.podIP
          resources:
            limits:
              nvidia.com/gpu: '1'
              rdma/roce_gdr: 1
            requests:
              nvidia.com/gpu: '1'
              rdma/roce_gdr: 1

Warning: Without InfiniBand/RoCE, KV cache transfer falls back to TCP, resulting in significantly degraded performance.

Custom EndpointPicker for P/D

For RHOAI 2.25/3.0, you must manually configure the EndpointPicker for P/D:

apiVersion: serving.kserve.io/v1alpha1
kind: LLMInferenceService
metadata:
  name: my-model-pd
spec:
  router:
    scheduler:
      endpointPickerConfig: |
        apiVersion: inference.networking.x-k8s.io/v1alpha1
        kind: EndpointPickerConfig
        plugins:
        - type: pd-profile-handler
          config:
            threshold: 500  # Requests below this use decode-only
        - type: prefill-filter
        - type: decode-filter
        - type: prefix-cache-scorer
        - type: load-aware-scorer
        - type: max-score-picker
        schedulingProfiles:
        - name: prefill
          plugins:
          - pluginRef: prefill-filter
          - pluginRef: load-aware-scorer
            weight: 1.0
          - pluginRef: max-score-picker
        - name: decode
          plugins:
          - pluginRef: decode-filter
          - pluginRef: prefix-cache-scorer
            weight: 2.0
          - pluginRef: load-aware-scorer
            weight: 1.0
          - pluginRef: max-score-picker

EndpointPicker Plugin Reference

The EndpointPicker controls how the LLM-D scheduler routes requests to vLLM instances. Understanding these plugins is essential for optimizing routing behavior.

Default Configuration (RHOAI 2.25/3.0)

If no endpointPickerConfig is provided, the default is:

apiVersion: inference.networking.x-k8s.io/v1alpha1
kind: EndpointPickerConfig
plugins:
- type: single-profile-handler
- type: prefix-cache-scorer
- type: load-aware-scorer
- type: max-score-picker
schedulingProfiles:
- name: default
  plugins:
  - pluginRef: prefix-cache-scorer
    weight: 2.0
  - pluginRef: load-aware-scorer
    weight: 1.0
  - pluginRef: max-score-picker

Note: In RHOAI 2.25/3.0, this default is static and not appropriate for P/D disaggregation. Future releases will dynamically configure based on spec.prefill presence.

Plugin Types

Plugins follow a three-phase scheduling flow: Filter → Score → Pick

Handlers

Handlers determine which scheduling profile to use.

Plugin	Description	Use Case
`single-profile-handler`	Uses a single profile named `default` for all requests	Standard deployments without P/D
`pd-profile-handler`	Selects `prefill` or `decode` profiles based on request	P/D disaggregation. Supports `threshold` config for small requests
`prefill-header-handler`	Sets prefill profile based on request header	Advanced P/D routing

Filters

Filters exclude endpoints that don't meet requirements.

Plugin	Description	Use Case
`prefill-filter`	Only allows prefill-capable endpoints	P/D disaggregation prefill profile
`decode-filter`	Only allows decode-capable endpoints	P/D disaggregation decode profile
`by-label-selector`	Filters pods using Kubernetes labels	Custom endpoint selection

Scorers

Scorers rank eligible endpoints. Higher scores are preferred.

Plugin	Description	Use Case
`prefix-cache-scorer`	Scores based on prompt prefix cache presence	Multi-turn conversations, RAG
`precise-prefix-cache-scorer`	Real-time KV-cache state tracking (more accurate)	High-throughput with strict SLOs
`load-aware-scorer`	Scores based on current load metrics	Even load distribution
`kv-cache-utilization-scorer`	Scores based on available KV cache capacity	Long-context workloads
`queue-scorer`	Scores based on queue depth/wait time	Latency-sensitive workloads
`active-request-scorer`	Scores based on active request count	Simple load balancing
`session-affinity-scorer`	Scores based on session history	Stateful conversations
`no-hit-lru-scorer`	LRU scoring for cache misses	Even cache distribution
`lora-affinity-scorer`	Scores based on loaded LoRA adapters	Multi-adapter deployments

Pickers

Pickers make the final endpoint selection.

Plugin	Description	Use Case
`max-score-picker`	Selects highest-scoring endpoint	Deterministic "best wins"
`random-picker`	Random selection from eligible set	Testing, baseline comparison
`weighted-random-picker`	Random selection weighted by scores	Softer optimization

Known Issues

kv-cache-utilization-scorer Bug (RHOAI 3.0/3.2)

The kv-cache-utilization-scorer plugin requires a workaround due to incorrect default metric configuration (RHOAIENG-41868):

apiVersion: serving.kserve.io/v1alpha1
kind: LLMInferenceService
spec:
  router:
    scheduler:
      template:
        containers:
          - name: scheduler
            args:
              - --kv-cache-usage-percentage-metric
              - vllm:kv_cache_usage_perc

precise-prefix-cache-scorer in Disconnected Environments

The precise-prefix-cache-scorer requires the scheduler to pull the tokenizer from HuggingFace, which may not work in disconnected environments.

Example: Intelligent Routing with KV Cache Awareness

apiVersion: serving.kserve.io/v1alpha1
kind: LLMInferenceService
metadata:
  name: my-model
spec:
  router:
    scheduler:
      template:
        containers:
          - name: scheduler
            args:
              - --kv-cache-usage-percentage-metric
              - vllm:kv_cache_usage_perc
      endpointPickerConfig: |
        apiVersion: inference.networking.x-k8s.io/v1alpha1
        kind: EndpointPickerConfig
        plugins:
        - type: single-profile-handler
        - type: prefix-cache-scorer
        - type: load-aware-scorer
        - type: kv-cache-utilization-scorer
        - type: max-score-picker
        schedulingProfiles:
        - name: default
          plugins:
          - pluginRef: prefix-cache-scorer
            weight: 2.0
          - pluginRef: load-aware-scorer
            weight: 1.0
          - pluginRef: kv-cache-utilization-scorer
            weight: 1.5
          - pluginRef: max-score-picker

For more details, see:

High-Speed Networking (RoCE)

For P/D disaggregation and multi-node deployments, high-speed networking is critical for KV cache transfer performance.

Reference: For detailed RoCE configuration on OpenShift, see the (PSAP) Guide to RoCE on OCP for llm-d.

Advanced vLLM Configuration

Template List Merge vs Replace Behavior

Critical: When customizing spec.template, some fields are additive (merged with defaults) while others completely replace the defaults. Misunderstanding this can break your deployment.

The spec.template section overrides values from the kserve-config-llm-template LLMInferenceServiceConfig in the redhat-ods-applications namespace.

Additive (Merged) Fields:

env - Environment variables are merged; your vars are added to defaults
volumes and volumeMounts - Added to existing mounts

Replacement Fields:

args - Completely replaces the default entrypoint arguments
command - Completely replaces the default command

Why VLLM_ADDITIONAL_ARGS Exists:

Because args is a replacement field, you cannot simply add arguments to the vLLM command line via spec.template.containers[].args - doing so would replace all default arguments and break the startup.

Instead, use the VLLM_ADDITIONAL_ARGS environment variable, which is read by the entrypoint script and appended to the default arguments:

# ✅ CORRECT - Use env var
spec:
  template:
    containers:
      - name: main
        env:
          - name: VLLM_ADDITIONAL_ARGS
            value: "--disable-uvicorn-access-log --max-model-len=32768"

# ❌ WRONG - This replaces ALL default args
spec:
  template:
    containers:
      - name: main
        args:
          - "--max-model-len=32768"  # Breaks startup!

Custom Probes for Large Models

Large models may require extended startup times:

spec:
  template:
    containers:
      - name: main
        livenessProbe:
          initialDelaySeconds: 10
          periodSeconds: 30
          timeoutSeconds: 30
          failureThreshold: 5
        startupProbe:
          httpGet:
            path: /health
            port: 8000
            scheme: HTTPS
          initialDelaySeconds: 15
          timeoutSeconds: 10
          periodSeconds: 10
          failureThreshold: 60  # Allow up to 10 minutes for startup

Setting Max Model Length

spec:
  template:
    containers:
      - name: main
        env:
          - name: VLLM_ADDITIONAL_ARGS
            value: "--disable-uvicorn-access-log --max-model-len=32768"

Warning: When applying additional args through the RHOAI Dashboard UI, they will be incorrectly added to the args section instead of VLLM_ADDITIONAL_ARGS, which will break the model server. Always use YAML manifests directly for custom vLLM arguments.

Replica Scaling

LLM-D does not currently support autoscaling vLLM replicas. Manual replica counts are specified via spec.replicas and spec.prefill.replicas.

Coming Soon: Autoscaling support is planned for RHOAI 3.4.

Multi-GPU with Tensor Parallelism

spec:
  template:
    containers:
      - name: main
        env:
          - name: VLLM_ADDITIONAL_ARGS
            value: "--disable-uvicorn-access-log --tensor-parallel-size=4"
        resources:
          limits:
            nvidia.com/gpu: '4'
          requests:
            nvidia.com/gpu: '4'

Authentication Configuration (RHOAI 3.0+)

Enable Authentication

apiVersion: serving.kserve.io/v1alpha1
kind: LLMInferenceService
metadata:
  name: my-model
  annotations:
    security.opendatahub.io/enable-auth: 'true'

Test with Authentication

# Get token
TOKEN=$(oc whoami --show-token)

# Make authenticated request
curl -s http://${INFERENCE_URL}/demo-llm/my-model/v1/models \
  -H "Authorization: Bearer ${TOKEN}" | jq

Disable Authentication (Development Only)

apiVersion: serving.kserve.io/v1alpha1
kind: LLMInferenceService
metadata:
  name: my-model
  annotations:
    security.opendatahub.io/enable-auth: 'false'

Warning: Authentication is broken in RHOAI 3.0 (RHOAIENG-39326) and should be resolved in 3.2. If Connectivity Link is not installed, you must explicitly set enable-auth: 'false'. If the annotation is omitted, it will attempt to use Connectivity Link and show errors.

External Exposure Limitation

Currently, there is no way to prevent an LLMInferenceService from being exposed outside the cluster. All models deployed via LLMInferenceService will be accessible through the Gateway. Use authentication (enable-auth: 'true') and network policies to control access.

Magic Annotations and Labels

Warning: KServe and OpenShift AI use "magic" annotations and labels that change deployment behavior. These are often undocumented and can cause unexpected results if set incorrectly.

Known Magic Annotations:

Annotation	Effect
`security.opendatahub.io/enable-auth`	Enables/disables Connectivity Link authentication
`opendatahub.io/hardware-profile-name`	Links to a hardware profile for resource defaults
`opendatahub.io/hardware-profile-namespace`	Namespace of the hardware profile
`opendatahub.io/model-type`	Model type classification (e.g., `generative`)
`k8s.v1.cni.cncf.io/networks`	Attaches secondary networks (e.g., RoCE)

Known Magic Labels:

Label	Effect
`opendatahub.io/dashboard`	Makes the model visible in RHOAI Dashboard
`opendatahub.io/genai-asset`	Marks as a GenAI asset

When troubleshooting unexpected behavior, check for these annotations/labels - they may be modifying defaults in non-obvious ways.

Dashboard Display Labels

Configure labels for RHOAI Dashboard visibility:

apiVersion: serving.kserve.io/v1alpha1
kind: LLMInferenceService
metadata:
  name: my-model
  annotations:
    opendatahub.io/connections: my-model
    opendatahub.io/hardware-profile-name: nvidia-gpu-serving
    opendatahub.io/hardware-profile-namespace: redhat-ods-applications
    opendatahub.io/model-type: generative
    openshift.io/display-name: My Custom Model
  labels:
    opendatahub.io/dashboard: "true"

Next Steps

Automated Deployment for GitOps patterns
Running Benchmarks to validate performance
Performance Debugging for optimization

FilesExpand file tree

03-advanced-deployment.md

Latest commit

History