-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Configure the vllm deployment with best practices for startup #550
Configure the vllm deployment with best practices for startup #550
Conversation
✅ Deploy Preview for gateway-api-inference-extension ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
0bdaaf8
to
310ffad
Compare
exec: | ||
# Verify that our core model is loaded before we consider startup successful | ||
command: | ||
- /bin/bash |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious, I've never used this before; making sure I understand:
https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.26/#execaction-v1-core suggests this runs in the container root. So this wouldn't work OOB on something like an alpine based image... but that's fine because the kubelet is still the probe manager, so even if this just instafailed, in 10 min it would fail & crashloop and an admin would very likely come investigate. Is that about right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It runs in the containers namespaces - so it's as if it were in a container.
If it were in a "distroless" / "single binary" container, the exec action would fail because bash would not be available in the container image. In general, single binaries are expected to embed any of this function in their code, so you'd generally be calling an http endpoint, or a subcommand / flag on the binary itself (think mytool check health
that encodes this logic). An advantage of exec actions is nothing is bound to the network, a disadvantage is that they are more expensive because setting up the namespaces takes a few tens of ms.
If someone rolled out a new version of vllm image that dropped everything except the single binary, the deployment would stop because the new pod wouldn't start (would crashloop).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not using httpGet?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because if the response doesn't contain hf we shouldn't load (the outcome we want is "don't consider the model server that is built around a base model ready until the base model is ready")
653c54d
to
05d7a27
Compare
05d7a27
to
3d85415
Compare
3d85415
to
833f942
Compare
/hold still testing the timeout |
833f942
to
01b2f86
Compare
/hold cancel 30s timeout was safe on GKE, i suspect EG/kgateway will be able to handle that (fewer layers to program?). |
exec: | ||
# Verify that our core model is loaded before we consider startup successful | ||
command: | ||
- /bin/bash |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not using httpGet?
ae983c4
to
08e2175
Compare
We want to recommend best practices for deployments of model servers under an InferencePool. Use the need to gracefully drain without client visible errors during rollout ("hitless" updates) to annotate the yaml with strong opinions on best practices. This configuration was experimentally verified on the GKE Inference Gateway configuration which should be longer than other servers.
08e2175
to
af61544
Compare
/hold for others to take a look. Feel free to unhold |
/lgtm |
/lgtm |
/unhold Thanks for tuning this! |
# | ||
# If this value is updated, be sure to update terminationGracePeriodSeconds. | ||
# | ||
seconds: 30 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am thinking this is not needed.
- vllm naturally use uvicorn to run the api server, which support denying new requests but wait until received requests done. there is a graceful timeout
timeout_graceful_shutdown
in it. But seems vllm does not configure it, not sure is it equals to no timeout or immediately shutdown. - gateway can move the vllm instance being deleted by checking the deletiontimestamp. This is what istio does now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
vLLM in my testing by default continues to process existing requests after SIGTERM while denying new requests, which is one prereq for graceful shutdown.
The other prereq: Load balancers in front of terminating pods still send requests until the termination propagates to them and they can roll out that config. In clouds or on-premise, that can be tens of seconds to minutes. Even with an HA configuration of a very fast propagation (pod terminated -> endpoint updated -> gateway controller reacts -> distributed via xDS -> all replicas react) it could still be 5-10s. This interval needs to be set to the longest time period that any supported gateway in front of the inference pool can react to propagation of endpoint data, which was 30s for GKE Gateway right now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I set to 30s because I expected most direct envoy implementations to respond faster, but they will still not be able to reliably propagate <5s (in large endpoint slices it might take the endpoint controller several seconds or more to react to pod timestamp being set).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The other prereq: Load balancers in front of terminating pods still send requests until the termination propagates to them and they can roll out that config.
Yes, this totally correct. An optional way should make gateway to handle fallback
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ahg-g, smarterclayton The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Model servers are production facing services, and should be configured in Kubernetes to not drop traffic or cause request errors during graceful rollout or predictable interruption (such as due to anticipated node maintenance from an admin).
Configure the default gpu_deployment.yaml example to accomplish hitless updates and graceful shutdown, and document why these values are chosen to allow others to determine best practices for model servers behind InferencePools. We should also update our documentation to describe these best practices once this configuration has been more widely validated
This PR:
enableServiceLinks
, which prevents a crash loop if the user deploys avllm
service in their namespace.Some of the documentation here should be translated to vLLM upstream charts and documentation - these numbers are not arbitrary and can be widely applied.
This change leverages a 1.30 beta feature (preStop.sleep) but 1.30 is a year old and I think it's a reasonable floor for examples of best practices.