Skip to content

rh-aiservices-bu/keda-llm-d-autoscaling

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

llm-d autoscaling with KEDA on OpenShift AI (4× L40S)

Demos KServe LLMInferenceService + KEDA autoscaling on 4× L40S, driving the decode Deployment off the vLLM waiting-queue metric surfaced through OpenShift's in-cluster Thanos.

Requires RHOAI 3.4+ — earlier versions ship a KServe without kserve/kserve#4996, which means the LLMInferenceService controller will fight the HPA for ownership of spec.replicas and KEDA will never keep a scale-up decision.

Default model: Granite 3.1 8B Instruct (W4A16 quantized) shipped as an OCI modelcar from registry.redhat.io.

llm-d-keda/
├── gateway/
│   └── gateway.yaml                     # GatewayClass + Gateway in openshift-ingress
├── llm-inferenceservice/
│   └── model.yaml                       # Namespace + LLMInferenceService (v1alpha2, oci:// modelcar)
├── image-prepull/
│   └── daemonset.yaml                   # Pre-pull modelcar + vLLM images onto GPU nodes
├── keda/
│   ├── 00-rbac.yaml                     # SA + cluster-monitoring-view + `view` in demo-llm
│   ├── 10-trigger-auth.yaml             # KEDA TriggerAuthentication
│   ├── 20-scaledobject.yaml             # ScaledObject — Prometheus trigger on vLLM queue depth
│   └── kustomization.yaml
└── load/hey.sh                          # load generator (reads LIS .status.url)

Prerequisites

  • OpenShift 4.19+ (ships the openshift.io/gateway-controller/v1 controller the Gateway API uses).
  • NFD Operator and NVIDIA GPU Operator, 4× L40S nodes Ready. Verify with oc get nodes -l nvidia.com/gpu.present=true — expect 4.
  • RHOAI 3.4+. The default v2 DSC (datasciencecluster.opendatahub.io/v2) with kserve: Managed is fine as-is. Confirm the KServe CRD advertises v1alpha2 as the storage version (the fix from PR #4996 shipped with v1alpha2):
    oc get crd llminferenceservices.serving.kserve.io -o json | \
      jq '.spec.versions[] | select(.storage==true) | .name'
    # → "v1alpha2"
    Then check the DSCI's ServiceMesh — LLMInferenceService brings its own Gateway, and a second Istio installed by ODH will collide:
    oc get dsci default-dsci -o jsonpath='{.spec.serviceMesh.managementState}{"\n"}'
    # Expect Unmanaged or Removed. If Managed:
    oc patch dsci default-dsci --type=merge \
      -p '{"spec":{"serviceMesh":{"managementState":"Unmanaged"}}}'
  • Custom Metrics Autoscaler (KEDA) operator installed with a cluster-wide KedaController. See Install the Custom Metrics Autoscaler.
  • User Workload Monitoring enabled, and the UWM stack actually rolled out — this is a common trap:
    cat <<'EOF' | oc apply -f -
    apiVersion: v1
    kind: ConfigMap
    metadata:
      name: cluster-monitoring-config
      namespace: openshift-monitoring
    data:
      config.yaml: |
        enableUserWorkload: true
    EOF
    oc -n openshift-user-workload-monitoring rollout status sts/prometheus-user-workload
    Without this, vLLM metrics never reach Thanos and KEDA won't scale.

Install the Custom Metrics Autoscaler (KEDA)

CMA is Red Hat's distribution of KEDA. Two gotchas from its install docs:

  • The KedaController must be named exactly keda and live in the same namespace as the operator. Other names are ignored.
  • Only one KedaController per cluster.

Easiest path is the OperatorHub UI. CLI equivalent:

oc create namespace openshift-keda

cat <<'EOF' | oc apply -f -
apiVersion: operators.coreos.com/v1
kind: OperatorGroup
metadata:
  name: openshift-keda
  namespace: openshift-keda
spec: {}
---
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  name: openshift-custom-metrics-autoscaler-operator
  namespace: openshift-keda
spec:
  channel: stable
  name: openshift-custom-metrics-autoscaler-operator
  source: redhat-operators
  sourceNamespace: openshift-marketplace
EOF

oc get csv -n openshift-keda -w           # wait for Succeeded

cat <<'EOF' | oc apply -f -
apiVersion: keda.sh/v1alpha1
kind: KedaController
metadata:
  name: keda
  namespace: openshift-keda
spec:
  watchNamespace: ""        # "" = cluster-wide; required so it picks up demo-llm
  operator:
    logLevel: info
  metricsServer:
    logLevel: "0"
EOF

oc get pods -n openshift-keda             # expect custom-metrics-autoscaler-operator,
                                          # keda-operator, keda-metrics-apiserver,
                                          # keda-admission — all 1/1 Running

Narrowing watchNamespace would stop KEDA from seeing the ScaledObject in demo-llm.

Step 0 — Pre-pull model + runtime images on GPU nodes (recommended)

Skipping this just means the first decode pod on each GPU node waits on a multi-GB image pull. Pre-pulling amortizes that once, so KEDA-driven scale-out is near-instant after.

oc new-project demo-llm                       # if not yet created
oc apply -f image-prepull/daemonset.yaml
oc -n demo-llm rollout status ds/gpu-image-prepull --timeout=15m

When the rollout reports READY equal to the number of GPU nodes (expect 4), the modelcar + vLLM images are cached on each one. The DaemonSet's tiny pause container holds references so the kubelet's image GC won't evict them.

Keep tags in sync. image-prepull/daemonset.yaml must reference the exact same modelcar URI as llm-inferenceservice/model.yaml (minus oci://) and the exact same vLLM image digest the LIS spawns. Find the latter after the LIS is deployed with:

oc get deploy model-kserve -n demo-llm \
  -o jsonpath='{.spec.template.spec.containers[0].image}{"\n"}'

and update the DaemonSet if it doesn't match.

Step 1 — Deploy the Gateway

oc apply -f gateway/gateway.yaml
oc get gatewayclass openshift-default
oc get gateway -n openshift-ingress openshift-ai-inference

Wait until the Gateway reports PROGRAMMED=True. The OCP Ingress Operator provisions a backing LoadBalancer / Route for it automatically.

Step 2 — Deploy the LLMInferenceService

The repo defaults to the Granite 3.1 8B Instruct (W4A16) quantized modelcar. Swap in a different model in llm-inferenceservice/model.yaml if desired (and keep image-prepull/daemonset.yaml in sync), then:

oc apply -f llm-inferenceservice/model.yaml

Wait for the LIS to reconcile (~2-4 minutes for first-time loading of the quantized 8B model):

oc get llminferenceservice -n demo-llm model -o yaml | yq '.status.conditions'
# Expect Ready=True, MainWorkloadReady=True, SchedulerWorkloadReady=True

Expect one restart per decode pod. There's a one-shot race between the modelcar sidecar (which sets up /mnt/models as a symlink into its own rootfs) and the vLLM main container. vLLM can start first, find /mnt/models unresolvable, error out once, then the kubelet restarts it. Pods stabilize at restartCount: 1. This repeats for every new decode pod that KEDA brings up. Noisy, not fatal.

Smoke-test via the external URL published on the LIS status:

URL=$(oc get llminferenceservice -n demo-llm model -o jsonpath='{.status.url}')
curl -s ${URL}/v1/models | jq

Confirm the served model id matches granite-3.1-8b-instruct (the name: value in the LIS spec). If it differs, update load/hey.sh to send the right id.

Step 3 — Verify the workload & metrics

3a. Confirm the decode Deployment name matches the ScaledObject

The LIS creates a decode Deployment and a router/scheduler Deployment. KEDA must target the decode one (the Deployment that requests a GPU):

oc get deploy -n demo-llm \
  -l app.kubernetes.io/component=llminferenceservice-workload,app.kubernetes.io/name=model

The default name is model-kserve — already wired into keda/20-scaledobject.yaml. If yours differs, update scaleTargetRef.name before Step 4.

The ...-router-scheduler Deployment is the EPP — do not point KEDA at it.

3b. Confirm vLLM metrics reach Thanos

The LIS controller installs a PodMonitor on the decode pods. With UWM on, the metric lands in Thanos after ~60s:

TOKEN=$(oc whoami -t)

oc -n openshift-monitoring exec -c thanos-query deploy/thanos-querier -- \
  curl -sk -H "Authorization: Bearer ${TOKEN}" \
  --data-urlencode 'query=sum(kserve_vllm:num_requests_waiting{namespace="demo-llm"})' \
  https://localhost:9091/api/v1/query | jq

Expect a scalar back (0 when idle). If "result": []:

oc get podmonitor,servicemonitor -n demo-llm        # PodMonitor must exist
oc get pods -n openshift-user-workload-monitoring   # UWM prometheus must be Running

Why kserve_ prefix? The LIS-installed PodMonitor applies a metricRelabelings rule that prefixes every scraped metric name with kserve_. So a pod emits vllm:num_requests_waiting but the series in Thanos is kserve_vllm:num_requests_waiting. The ScaledObject uses the prefixed name.

Why not EPP flow-control metrics? The upstream Gateway API Inference Extension EPP shipped with this KServe doesn't support the flowControl feature gate, so inference_extension_flow_control_queue_size isn't emitted. kserve_vllm:num_requests_waiting gives us the same scale-out signal from the vLLM pods directly.

Step 4 — Apply KEDA

oc apply -k keda/
oc get scaledobject,hpa -n demo-llm

Expect a ScaledObject/llm-d-decode with READY=True, ACTIVE=False when idle, and an HPA/keda-hpa-llm-d-decode with REPLICAS=1 and external metric 0/5 (avg). Under load, ACTIVE flips to True and REPLICAS climbs toward maxReplicaCount.

Step 5 — Drive load and watch it scale

Terminal A:

watch -n 2 'oc get pods,hpa,scaledobject -n demo-llm'

Terminal B:

NAMESPACE=demo-llm MODEL=granite-3.1-8b-instruct ./load/hey.sh

Expected: waiting-queue metric climbs → HPA bumps replicas → Deployment scales 1 → 2 → 3 → 4 (one pod per L40S) → load stops → 5-minute cooldown window → scale back to 1.

The HPA's TARGETS column reports per-pod average with a millicore-style suffix: e.g. 40500m/5 means average queue depth is 40.5 against a target of 5.

Notes / gotchas

  • Leave spec.replicas unset on the LIS. Per kserve/kserve#4996 the LIS controller preserves externally-managed replicas only when the user doesn't declare them. Setting any value (even 1) flips the controller back to overwrite mode and resurrects the HPA-vs-LIS race. This is why RHOAI 3.4 is the minimum — the PR didn't ship earlier.
  • Modelcar vs HF pull. spec.model.uri: oci://... uses a container image whose filesystem holds the model artifacts, pulled once per node. Compare to hf://... which redownloads per pod unless you wire up a shared PVC. Build your own modelcar with a Dockerfile like:
    FROM registry.access.redhat.com/ubi9-minimal:latest AS copier
    RUN microdnf install -y python3-pip && pip3 install huggingface_hub
    RUN huggingface-cli download <model-id> --local-dir /models
    FROM scratch
    COPY --from=copier /models /models
    Or use a prebuilt one from https://github.com/rh-aiservices-bu/modelcar-catalog.
  • Modelcar/vLLM startup race. vLLM can briefly start before the modelcar sidecar populates the shared-PID-namespace symlink at /mnt/models. First-boot exits 1, kubelet restarts, second start is clean. restartCount: 1 is normal. For a production fix, file an upstream KServe issue — the sidecar should use Kubernetes 1.28+ native init-sidecar ordering so the main container waits.
  • GPU capacity has to match maxReplicaCount. If the cluster's GPU MachineSet has fewer Ready nodes than maxReplicaCount: 4, the pod(s) that can't schedule will sit Pending. Check with:
    oc get nodes -l nvidia.com/gpu.present=true
    If this cluster has a MachineAutoscaler on the GPU pool, the Cluster Autoscaler will provision additional nodes on demand (slow — node-boot timescale). For a demo where you want all 4 nodes permanently, pin the MachineSet and delete the MachineAutoscaler:
    oc delete machineautoscaler <gpu-ms-name> -n openshift-machine-api
    oc scale machineset <gpu-ms-name> -n openshift-machine-api --replicas=4
  • Metric labels. The PromQL kserve_vllm:num_requests_waiting{namespace="demo-llm"} filters by namespace. If you rename the namespace, update keda/20-scaledobject.yaml to match.
  • TLS to Thanos. unsafeSsl: "true" in the trigger keeps the demo short. Swap to a CA-bundle Secret (OCP auto-injects the service CA into openshift-service-ca.crt ConfigMap — copy into a Secret) + parameter: ca on the TriggerAuthentication for production.
  • Scale to zero. Set minReplicaCount: 0. Because this EPP lacks a flow-control buffering layer, the first request after cold-start hits the Gateway with no pods — expect connection resets until one becomes Ready. If that's a showstopper, keep minReplicaCount: 1.

References

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages