Configure the vllm deployment with best practices for startup #550

smarterclayton · 2025-03-20T17:35:56Z

Model servers are production facing services, and should be configured in Kubernetes to not drop traffic or cause request errors during graceful rollout or predictable interruption (such as due to anticipated node maintenance from an admin).

Configure the default gpu_deployment.yaml example to accomplish hitless updates and graceful shutdown, and document why these values are chosen to allow others to determine best practices for model servers behind InferencePools. We should also update our documentation to describe these best practices once this configuration has been more widely validated

This PR:

Tunes liveness and readiness checks based on the vLLM implementation and best practices
Adds a startupProbe (see Improve vLLM upstream health checks to only pass when models are servable #558) that blocks regular liveness and readiness from being run until the model server has actually loaded the model, which prevents requests from failing when new pods are created
Adds a preStop hook with a sleep that is tuned to the longest propagation period we expect before any supported gateway takes us out of rotation
Tunes the terminationGracePeriodSeconds to match the expected preStop plus maximum drain latency plus safety buffer
Annotates all the fields with a description of why these values were chosen for others to learn from when new model servers are added.
Disables enableServiceLinks, which prevents a crash loop if the user deploys a vllm service in their namespace.

Some of the documentation here should be translated to vLLM upstream charts and documentation - these numbers are not arbitrary and can be widely applied.

This change leverages a 1.30 beta feature (preStop.sleep) but 1.30 is a year old and I think it's a reasonable floor for examples of best practices.

netlify · 2025-03-20T17:36:16Z

✅ Deploy Preview for gateway-api-inference-extension ready!

Name	Link
🔨 Latest commit	`af61544`
🔍 Latest deploy log	https://app.netlify.com/sites/gateway-api-inference-extension/deploys/67e1ad3a7894b90007824716
😎 Deploy Preview	https://deploy-preview-550--gateway-api-inference-extension.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

kfswain · 2025-03-20T19:28:45Z

config/manifests/vllm/gpu-deployment.yaml

+            exec:
+              # Verify that our core model is loaded before we consider startup successful
+              command:
+              - /bin/bash


Just curious, I've never used this before; making sure I understand:
https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.26/#execaction-v1-core suggests this runs in the container root. So this wouldn't work OOB on something like an alpine based image... but that's fine because the kubelet is still the probe manager, so even if this just instafailed, in 10 min it would fail & crashloop and an admin would very likely come investigate. Is that about right?

It runs in the containers namespaces - so it's as if it were in a container.

If it were in a "distroless" / "single binary" container, the exec action would fail because bash would not be available in the container image. In general, single binaries are expected to embed any of this function in their code, so you'd generally be calling an http endpoint, or a subcommand / flag on the binary itself (think mytool check health that encodes this logic). An advantage of exec actions is nothing is bound to the network, a disadvantage is that they are more expensive because setting up the namespaces takes a few tens of ms.

If someone rolled out a new version of vllm image that dropped everything except the single binary, the deployment would stop because the new pod wouldn't start (would crashloop).

Why not using httpGet?

Because if the response doesn't contain hf we shouldn't load (the outcome we want is "don't consider the model server that is built around a base model ready until the base model is ready")

smarterclayton · 2025-03-21T15:03:28Z

/hold

still testing the timeout

smarterclayton · 2025-03-21T18:15:38Z

/hold cancel

30s timeout was safe on GKE, i suspect EG/kgateway will be able to handle that (fewer layers to program?).

config/manifests/vllm/gpu-deployment.yaml

liu-cong · 2025-03-21T19:41:00Z

config/manifests/vllm/gpu-deployment.yaml

+            exec:
+              # Verify that our core model is loaded before we consider startup successful
+              command:
+              - /bin/bash


Why not using httpGet?

We want to recommend best practices for deployments of model servers under an InferencePool. Use the need to gracefully drain without client visible errors during rollout ("hitless" updates) to annotate the yaml with strong opinions on best practices. This configuration was experimentally verified on the GKE Inference Gateway configuration which should be longer than other servers.

liu-cong · 2025-03-24T19:22:18Z

/hold for others to take a look. Feel free to unhold

liu-cong · 2025-03-24T19:22:24Z

/lgtm

kfswain · 2025-03-24T22:45:32Z

/lgtm

kfswain · 2025-03-24T22:46:04Z

/unhold

Thanks for tuning this!

hzxuzhonghu · 2025-03-25T03:27:39Z

config/manifests/vllm/gpu-deployment.yaml

+                #
+                # If this value is updated, be sure to update terminationGracePeriodSeconds.
+                #
+                seconds: 30


I am thinking this is not needed.

vllm naturally use uvicorn to run the api server, which support denying new requests but wait until received requests done. there is a graceful timeout timeout_graceful_shutdown in it. But seems vllm does not configure it, not sure is it equals to no timeout or immediately shutdown.

gateway can move the vllm instance being deleted by checking the deletiontimestamp. This is what istio does now

vLLM in my testing by default continues to process existing requests after SIGTERM while denying new requests, which is one prereq for graceful shutdown.

The other prereq: Load balancers in front of terminating pods still send requests until the termination propagates to them and they can roll out that config. In clouds or on-premise, that can be tens of seconds to minutes. Even with an HA configuration of a very fast propagation (pod terminated -> endpoint updated -> gateway controller reacts -> distributed via xDS -> all replicas react) it could still be 5-10s. This interval needs to be set to the longest time period that any supported gateway in front of the inference pool can react to propagation of endpoint data, which was 30s for GKE Gateway right now.

I set to 30s because I expected most direct envoy implementations to respond faster, but they will still not be able to reliably propagate <5s (in large endpoint slices it might take the endpoint controller several seconds or more to react to pod timestamp being set).

The other prereq: Load balancers in front of terminating pods still send requests until the termination propagates to them and they can roll out that config.

Yes, this totally correct. An optional way should make gateway to handle fallback

ahg-g · 2025-03-25T15:49:21Z

/approve

k8s-ci-robot · 2025-03-25T15:49:29Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ahg-g, smarterclayton

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [ahg-g]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

…etes-sigs#550) We want to recommend best practices for deployments of model servers under an InferencePool. Use the need to gracefully drain without client visible errors during rollout ("hitless" updates) to annotate the yaml with strong opinions on best practices. This configuration was experimentally verified on the GKE Inference Gateway configuration which should be longer than other servers.

k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Mar 20, 2025

k8s-ci-robot requested review from danehans and kfswain March 20, 2025 17:36

k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Mar 20, 2025

smarterclayton force-pushed the failure_thresholds branch 4 times, most recently from 0bdaaf8 to 310ffad Compare March 20, 2025 19:26

kfswain reviewed Mar 20, 2025

View reviewed changes

smarterclayton force-pushed the failure_thresholds branch 3 times, most recently from 653c54d to 05d7a27 Compare March 20, 2025 20:28

smarterclayton mentioned this pull request Mar 21, 2025

Hitless Rollout Investigation #557

Open

smarterclayton force-pushed the failure_thresholds branch from 05d7a27 to 3d85415 Compare March 21, 2025 14:53

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Mar 21, 2025

smarterclayton mentioned this pull request Mar 21, 2025

Improve vLLM upstream health checks to only pass when models are servable #558

Closed

smarterclayton force-pushed the failure_thresholds branch from 3d85415 to 833f942 Compare March 21, 2025 14:58

smarterclayton changed the title ~~WIP - Configure the vllm deployment with best practices for startup~~ Configure the vllm deployment with best practices for startup Mar 21, 2025

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 21, 2025

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 21, 2025

smarterclayton force-pushed the failure_thresholds branch from 833f942 to 01b2f86 Compare March 21, 2025 18:03

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 21, 2025

liu-cong reviewed Mar 21, 2025

View reviewed changes

smarterclayton force-pushed the failure_thresholds branch 2 times, most recently from ae983c4 to 08e2175 Compare March 21, 2025 20:20

smarterclayton force-pushed the failure_thresholds branch from 08e2175 to af61544 Compare March 24, 2025 19:06

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 24, 2025

k8s-ci-robot assigned liu-cong Mar 24, 2025

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 24, 2025

k8s-ci-robot assigned kfswain Mar 24, 2025

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 24, 2025

hzxuzhonghu reviewed Mar 25, 2025

View reviewed changes

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 25, 2025

k8s-ci-robot merged commit 731f244 into kubernetes-sigs:main Mar 25, 2025
8 checks passed

kfswain mentioned this pull request Apr 22, 2025

We should encourage all InferencePool deployments to gracefully rollout and drain #549

Closed

Configure the vllm deployment with best practices for startup #550

Configure the vllm deployment with best practices for startup #550

Uh oh!

Conversation

smarterclayton commented Mar 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

netlify bot commented Mar 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for gateway-api-inference-extension ready!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

smarterclayton commented Mar 21, 2025

Uh oh!

smarterclayton commented Mar 21, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

liu-cong commented Mar 24, 2025

Uh oh!

liu-cong commented Mar 24, 2025

Uh oh!

kfswain commented Mar 24, 2025

Uh oh!

kfswain commented Mar 24, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ahg-g commented Mar 25, 2025

Uh oh!

k8s-ci-robot commented Mar 25, 2025

Uh oh!

Uh oh!

Uh oh!

smarterclayton commented Mar 20, 2025 •

edited

Loading

netlify bot commented Mar 20, 2025 •

edited

Loading