Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Configure the vllm deployment with best practices for startup #550

Merged

Conversation

smarterclayton
Copy link
Contributor

@smarterclayton smarterclayton commented Mar 20, 2025

Model servers are production facing services, and should be configured in Kubernetes to not drop traffic or cause request errors during graceful rollout or predictable interruption (such as due to anticipated node maintenance from an admin).

Configure the default gpu_deployment.yaml example to accomplish hitless updates and graceful shutdown, and document why these values are chosen to allow others to determine best practices for model servers behind InferencePools. We should also update our documentation to describe these best practices once this configuration has been more widely validated

This PR:

  • Tunes liveness and readiness checks based on the vLLM implementation and best practices
  • Adds a startupProbe (see Improve vLLM upstream health checks to only pass when models are servable #558) that blocks regular liveness and readiness from being run until the model server has actually loaded the model, which prevents requests from failing when new pods are created
  • Adds a preStop hook with a sleep that is tuned to the longest propagation period we expect before any supported gateway takes us out of rotation
  • Tunes the terminationGracePeriodSeconds to match the expected preStop plus maximum drain latency plus safety buffer
  • Annotates all the fields with a description of why these values were chosen for others to learn from when new model servers are added.
  • Disables enableServiceLinks, which prevents a crash loop if the user deploys a vllm service in their namespace.

Some of the documentation here should be translated to vLLM upstream charts and documentation - these numbers are not arbitrary and can be widely applied.

This change leverages a 1.30 beta feature (preStop.sleep) but 1.30 is a year old and I think it's a reasonable floor for examples of best practices.

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Mar 20, 2025
@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Mar 20, 2025
Copy link

netlify bot commented Mar 20, 2025

Deploy Preview for gateway-api-inference-extension ready!

Name Link
🔨 Latest commit af61544
🔍 Latest deploy log https://app.netlify.com/sites/gateway-api-inference-extension/deploys/67e1ad3a7894b90007824716
😎 Deploy Preview https://deploy-preview-550--gateway-api-inference-extension.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@smarterclayton smarterclayton force-pushed the failure_thresholds branch 4 times, most recently from 0bdaaf8 to 310ffad Compare March 20, 2025 19:26
exec:
# Verify that our core model is loaded before we consider startup successful
command:
- /bin/bash
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious, I've never used this before; making sure I understand:
https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.26/#execaction-v1-core suggests this runs in the container root. So this wouldn't work OOB on something like an alpine based image... but that's fine because the kubelet is still the probe manager, so even if this just instafailed, in 10 min it would fail & crashloop and an admin would very likely come investigate. Is that about right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It runs in the containers namespaces - so it's as if it were in a container.

If it were in a "distroless" / "single binary" container, the exec action would fail because bash would not be available in the container image. In general, single binaries are expected to embed any of this function in their code, so you'd generally be calling an http endpoint, or a subcommand / flag on the binary itself (think mytool check health that encodes this logic). An advantage of exec actions is nothing is bound to the network, a disadvantage is that they are more expensive because setting up the namespaces takes a few tens of ms.

If someone rolled out a new version of vllm image that dropped everything except the single binary, the deployment would stop because the new pod wouldn't start (would crashloop).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not using httpGet?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because if the response doesn't contain hf we shouldn't load (the outcome we want is "don't consider the model server that is built around a base model ready until the base model is ready")

@smarterclayton smarterclayton force-pushed the failure_thresholds branch 3 times, most recently from 653c54d to 05d7a27 Compare March 20, 2025 20:28
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Mar 21, 2025
@smarterclayton smarterclayton changed the title WIP - Configure the vllm deployment with best practices for startup Configure the vllm deployment with best practices for startup Mar 21, 2025
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 21, 2025
@smarterclayton
Copy link
Contributor Author

/hold

still testing the timeout

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 21, 2025
@smarterclayton
Copy link
Contributor Author

/hold cancel

30s timeout was safe on GKE, i suspect EG/kgateway will be able to handle that (fewer layers to program?).

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 21, 2025
exec:
# Verify that our core model is loaded before we consider startup successful
command:
- /bin/bash
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not using httpGet?

@smarterclayton smarterclayton force-pushed the failure_thresholds branch 2 times, most recently from ae983c4 to 08e2175 Compare March 21, 2025 20:20
We want to recommend best practices for deployments of model servers
under an InferencePool. Use the need to gracefully drain without
client visible errors during rollout ("hitless" updates) to
annotate the yaml with strong opinions on best practices.

This configuration was experimentally verified on the GKE Inference
Gateway configuration which should be longer than other servers.
@liu-cong
Copy link
Contributor

/hold for others to take a look. Feel free to unhold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 24, 2025
@liu-cong
Copy link
Contributor

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 24, 2025
@kfswain
Copy link
Collaborator

kfswain commented Mar 24, 2025

/lgtm

@kfswain
Copy link
Collaborator

kfswain commented Mar 24, 2025

/unhold

Thanks for tuning this!

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 24, 2025
#
# If this value is updated, be sure to update terminationGracePeriodSeconds.
#
seconds: 30
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am thinking this is not needed.

  1. vllm naturally use uvicorn to run the api server, which support denying new requests but wait until received requests done. there is a graceful timeout timeout_graceful_shutdown in it. But seems vllm does not configure it, not sure is it equals to no timeout or immediately shutdown.
  2. gateway can move the vllm instance being deleted by checking the deletiontimestamp. This is what istio does now

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vLLM in my testing by default continues to process existing requests after SIGTERM while denying new requests, which is one prereq for graceful shutdown.

The other prereq: Load balancers in front of terminating pods still send requests until the termination propagates to them and they can roll out that config. In clouds or on-premise, that can be tens of seconds to minutes. Even with an HA configuration of a very fast propagation (pod terminated -> endpoint updated -> gateway controller reacts -> distributed via xDS -> all replicas react) it could still be 5-10s. This interval needs to be set to the longest time period that any supported gateway in front of the inference pool can react to propagation of endpoint data, which was 30s for GKE Gateway right now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I set to 30s because I expected most direct envoy implementations to respond faster, but they will still not be able to reliably propagate <5s (in large endpoint slices it might take the endpoint controller several seconds or more to react to pod timestamp being set).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other prereq: Load balancers in front of terminating pods still send requests until the termination propagates to them and they can roll out that config.

Yes, this totally correct. An optional way should make gateway to handle fallback

@ahg-g
Copy link
Contributor

ahg-g commented Mar 25, 2025

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ahg-g, smarterclayton

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 25, 2025
@k8s-ci-robot k8s-ci-robot merged commit 731f244 into kubernetes-sigs:main Mar 25, 2025
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants