Demos KServe LLMInferenceService + KEDA autoscaling on 4× L40S, driving the decode Deployment off the vLLM waiting-queue metric surfaced through OpenShift's in-cluster Thanos.
Requires RHOAI 3.4+ — earlier versions ship a KServe without
kserve/kserve#4996, which
means the LLMInferenceService controller will fight the HPA for ownership
of spec.replicas and KEDA will never keep a scale-up decision.
Default model: Granite 3.1 8B Instruct (W4A16 quantized) shipped as an
OCI modelcar from registry.redhat.io.
llm-d-keda/
├── gateway/
│ └── gateway.yaml # GatewayClass + Gateway in openshift-ingress
├── llm-inferenceservice/
│ └── model.yaml # Namespace + LLMInferenceService (v1alpha2, oci:// modelcar)
├── image-prepull/
│ └── daemonset.yaml # Pre-pull modelcar + vLLM images onto GPU nodes
├── keda/
│ ├── 00-rbac.yaml # SA + cluster-monitoring-view + `view` in demo-llm
│ ├── 10-trigger-auth.yaml # KEDA TriggerAuthentication
│ ├── 20-scaledobject.yaml # ScaledObject — Prometheus trigger on vLLM queue depth
│ └── kustomization.yaml
└── load/hey.sh # load generator (reads LIS .status.url)
- OpenShift 4.19+ (ships the
openshift.io/gateway-controller/v1controller the Gateway API uses). - NFD Operator and NVIDIA GPU Operator, 4× L40S nodes Ready.
Verify with
oc get nodes -l nvidia.com/gpu.present=true— expect 4. - RHOAI 3.4+. The default v2 DSC
(
datasciencecluster.opendatahub.io/v2) withkserve: Managedis fine as-is. Confirm the KServe CRD advertises v1alpha2 as the storage version (the fix from PR #4996 shipped with v1alpha2):Then check the DSCI's ServiceMesh — LLMInferenceService brings its own Gateway, and a second Istio installed by ODH will collide:oc get crd llminferenceservices.serving.kserve.io -o json | \ jq '.spec.versions[] | select(.storage==true) | .name' # → "v1alpha2"
oc get dsci default-dsci -o jsonpath='{.spec.serviceMesh.managementState}{"\n"}' # Expect Unmanaged or Removed. If Managed: oc patch dsci default-dsci --type=merge \ -p '{"spec":{"serviceMesh":{"managementState":"Unmanaged"}}}'
- Custom Metrics Autoscaler (KEDA) operator installed with a cluster-wide
KedaController. See Install the Custom Metrics Autoscaler. - User Workload Monitoring enabled, and the UWM stack actually rolled
out — this is a common trap:
Without this, vLLM metrics never reach Thanos and KEDA won't scale.
cat <<'EOF' | oc apply -f - apiVersion: v1 kind: ConfigMap metadata: name: cluster-monitoring-config namespace: openshift-monitoring data: config.yaml: | enableUserWorkload: true EOF oc -n openshift-user-workload-monitoring rollout status sts/prometheus-user-workload
CMA is Red Hat's distribution of KEDA. Two gotchas from its install docs:
- The
KedaControllermust be named exactlykedaand live in the same namespace as the operator. Other names are ignored. - Only one
KedaControllerper cluster.
Easiest path is the OperatorHub UI. CLI equivalent:
oc create namespace openshift-keda
cat <<'EOF' | oc apply -f -
apiVersion: operators.coreos.com/v1
kind: OperatorGroup
metadata:
name: openshift-keda
namespace: openshift-keda
spec: {}
---
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
name: openshift-custom-metrics-autoscaler-operator
namespace: openshift-keda
spec:
channel: stable
name: openshift-custom-metrics-autoscaler-operator
source: redhat-operators
sourceNamespace: openshift-marketplace
EOF
oc get csv -n openshift-keda -w # wait for Succeeded
cat <<'EOF' | oc apply -f -
apiVersion: keda.sh/v1alpha1
kind: KedaController
metadata:
name: keda
namespace: openshift-keda
spec:
watchNamespace: "" # "" = cluster-wide; required so it picks up demo-llm
operator:
logLevel: info
metricsServer:
logLevel: "0"
EOF
oc get pods -n openshift-keda # expect custom-metrics-autoscaler-operator,
# keda-operator, keda-metrics-apiserver,
# keda-admission — all 1/1 RunningNarrowing watchNamespace would stop KEDA from seeing the ScaledObject
in demo-llm.
Skipping this just means the first decode pod on each GPU node waits on a multi-GB image pull. Pre-pulling amortizes that once, so KEDA-driven scale-out is near-instant after.
oc new-project demo-llm # if not yet created
oc apply -f image-prepull/daemonset.yaml
oc -n demo-llm rollout status ds/gpu-image-prepull --timeout=15mWhen the rollout reports READY equal to the number of GPU nodes
(expect 4), the modelcar + vLLM images are cached on each one. The
DaemonSet's tiny pause container holds references so the kubelet's
image GC won't evict them.
Keep tags in sync. image-prepull/daemonset.yaml must reference the
exact same modelcar URI as llm-inferenceservice/model.yaml (minus
oci://) and the exact same vLLM image digest the LIS spawns. Find the
latter after the LIS is deployed with:
oc get deploy model-kserve -n demo-llm \
-o jsonpath='{.spec.template.spec.containers[0].image}{"\n"}'and update the DaemonSet if it doesn't match.
oc apply -f gateway/gateway.yaml
oc get gatewayclass openshift-default
oc get gateway -n openshift-ingress openshift-ai-inferenceWait until the Gateway reports PROGRAMMED=True. The OCP Ingress Operator
provisions a backing LoadBalancer / Route for it automatically.
The repo defaults to the Granite 3.1 8B Instruct (W4A16) quantized
modelcar. Swap in a different model in llm-inferenceservice/model.yaml
if desired (and keep image-prepull/daemonset.yaml in sync), then:
oc apply -f llm-inferenceservice/model.yamlWait for the LIS to reconcile (~2-4 minutes for first-time loading of the quantized 8B model):
oc get llminferenceservice -n demo-llm model -o yaml | yq '.status.conditions'
# Expect Ready=True, MainWorkloadReady=True, SchedulerWorkloadReady=TrueExpect one restart per decode pod. There's a one-shot race between the modelcar sidecar (which sets up
/mnt/modelsas a symlink into its own rootfs) and the vLLM main container. vLLM can start first, find/mnt/modelsunresolvable, error out once, then the kubelet restarts it. Pods stabilize atrestartCount: 1. This repeats for every new decode pod that KEDA brings up. Noisy, not fatal.
Smoke-test via the external URL published on the LIS status:
URL=$(oc get llminferenceservice -n demo-llm model -o jsonpath='{.status.url}')
curl -s ${URL}/v1/models | jqConfirm the served model id matches granite-3.1-8b-instruct
(the name: value in the LIS spec). If it differs, update load/hey.sh
to send the right id.
The LIS creates a decode Deployment and a router/scheduler Deployment. KEDA must target the decode one (the Deployment that requests a GPU):
oc get deploy -n demo-llm \
-l app.kubernetes.io/component=llminferenceservice-workload,app.kubernetes.io/name=modelThe default name is model-kserve — already wired into
keda/20-scaledobject.yaml. If yours differs, update
scaleTargetRef.name before Step 4.
The ...-router-scheduler Deployment is the EPP — do not point KEDA
at it.
The LIS controller installs a PodMonitor on the decode pods. With UWM
on, the metric lands in Thanos after ~60s:
TOKEN=$(oc whoami -t)
oc -n openshift-monitoring exec -c thanos-query deploy/thanos-querier -- \
curl -sk -H "Authorization: Bearer ${TOKEN}" \
--data-urlencode 'query=sum(kserve_vllm:num_requests_waiting{namespace="demo-llm"})' \
https://localhost:9091/api/v1/query | jqExpect a scalar back (0 when idle). If "result": []:
oc get podmonitor,servicemonitor -n demo-llm # PodMonitor must exist
oc get pods -n openshift-user-workload-monitoring # UWM prometheus must be RunningWhy
kserve_prefix? The LIS-installed PodMonitor applies ametricRelabelingsrule that prefixes every scraped metric name withkserve_. So a pod emitsvllm:num_requests_waitingbut the series in Thanos iskserve_vllm:num_requests_waiting. The ScaledObject uses the prefixed name.Why not EPP flow-control metrics? The upstream Gateway API Inference Extension EPP shipped with this KServe doesn't support the
flowControlfeature gate, soinference_extension_flow_control_queue_sizeisn't emitted.kserve_vllm:num_requests_waitinggives us the same scale-out signal from the vLLM pods directly.
oc apply -k keda/
oc get scaledobject,hpa -n demo-llmExpect a ScaledObject/llm-d-decode with READY=True, ACTIVE=False when
idle, and an HPA/keda-hpa-llm-d-decode with REPLICAS=1 and external
metric 0/5 (avg). Under load, ACTIVE flips to True and REPLICAS
climbs toward maxReplicaCount.
Terminal A:
watch -n 2 'oc get pods,hpa,scaledobject -n demo-llm'Terminal B:
NAMESPACE=demo-llm MODEL=granite-3.1-8b-instruct ./load/hey.shExpected: waiting-queue metric climbs → HPA bumps replicas → Deployment scales 1 → 2 → 3 → 4 (one pod per L40S) → load stops → 5-minute cooldown window → scale back to 1.
The HPA's TARGETS column reports per-pod average with a millicore-style
suffix: e.g. 40500m/5 means average queue depth is 40.5 against a
target of 5.
- Leave
spec.replicasunset on the LIS. Per kserve/kserve#4996 the LIS controller preserves externally-managed replicas only when the user doesn't declare them. Setting any value (even1) flips the controller back to overwrite mode and resurrects the HPA-vs-LIS race. This is why RHOAI 3.4 is the minimum — the PR didn't ship earlier. - Modelcar vs HF pull.
spec.model.uri: oci://...uses a container image whose filesystem holds the model artifacts, pulled once per node. Compare tohf://...which redownloads per pod unless you wire up a shared PVC. Build your own modelcar with a Dockerfile like:Or use a prebuilt one from https://github.com/rh-aiservices-bu/modelcar-catalog.FROM registry.access.redhat.com/ubi9-minimal:latest AS copier RUN microdnf install -y python3-pip && pip3 install huggingface_hub RUN huggingface-cli download <model-id> --local-dir /models FROM scratch COPY --from=copier /models /models
- Modelcar/vLLM startup race. vLLM can briefly start before the
modelcar sidecar populates the shared-PID-namespace symlink at
/mnt/models. First-boot exits 1, kubelet restarts, second start is clean.restartCount: 1is normal. For a production fix, file an upstream KServe issue — the sidecar should use Kubernetes 1.28+ native init-sidecar ordering so the main container waits. - GPU capacity has to match
maxReplicaCount. If the cluster's GPU MachineSet has fewer Ready nodes thanmaxReplicaCount: 4, the pod(s) that can't schedule will sit Pending. Check with:If this cluster has aoc get nodes -l nvidia.com/gpu.present=true
MachineAutoscaleron the GPU pool, the Cluster Autoscaler will provision additional nodes on demand (slow — node-boot timescale). For a demo where you want all 4 nodes permanently, pin the MachineSet and delete the MachineAutoscaler:oc delete machineautoscaler <gpu-ms-name> -n openshift-machine-api oc scale machineset <gpu-ms-name> -n openshift-machine-api --replicas=4
- Metric labels. The PromQL
kserve_vllm:num_requests_waiting{namespace="demo-llm"}filters by namespace. If you rename the namespace, updatekeda/20-scaledobject.yamlto match. - TLS to Thanos.
unsafeSsl: "true"in the trigger keeps the demo short. Swap to a CA-bundle Secret (OCP auto-injects the service CA intoopenshift-service-ca.crtConfigMap — copy into a Secret) +parameter: caon theTriggerAuthenticationfor production. - Scale to zero. Set
minReplicaCount: 0. Because this EPP lacks a flow-control buffering layer, the first request after cold-start hits the Gateway with no pods — expect connection resets until one becomes Ready. If that's a showstopper, keepminReplicaCount: 1.