Garbage collect obsolete vLLM instances#428
Garbage collect obsolete vLLM instances#428waltforme wants to merge 3 commits intollm-d-incubation:mainfrom
Conversation
Signed-off-by: Jun Duan <jun.duan.phd@outlook.com>
…unbinding Signed-off-by: Jun Duan <jun.duan.phd@outlook.com>
There was a problem hiding this comment.
Pull request overview
This PR adds controller-side garbage collection for vLLM instances that become obsolete after an InferenceServerConfig (ISC) change, covering both sleeping instances (cleanup after ISC update) and awake instances (delete on unbind rather than sleeping). It also extends the E2E suite to validate these behaviors.
Changes:
- Add async cleanup of obsolete sleeping launcher instances triggered by ISC spec updates.
- On unbind, detect obsolete awake launcher instances and delete them instead of sleeping.
- Extend E2E test-cases to validate obsolete-instance deletion scenarios.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 7 comments.
| File | Description |
|---|---|
test/e2e/test-cases.sh |
Adds helper + new E2E cases validating deletion of obsolete sleeping/awake instances. |
pkg/controller/dual-pods/inference-server.go |
Adds “delete obsolete instance on unbind” logic; threads context.Context into HTTP GET helpers. |
pkg/controller/dual-pods/controller.go |
Triggers obsolete sleeping-instance cleanup on ISC updates; adds internal state deletion helper. |
| resp, err := http.Post(sleepURL, "", nil) | ||
| if err != nil { | ||
| return fmt.Errorf("failed to put provider %q to sleep, POST %s got error: %w", serverDat.ProvidingPodName, sleepURL, err) | ||
| } |
There was a problem hiding this comment.
ensureUnbound calls http.Post but doesn’t close resp.Body. This can leak connections/file descriptors over time in a long-running controller. Ensure resp.Body.Close() is called (typically via defer) after a successful POST, even when only checking StatusCode.
| } | |
| } | |
| defer func() { | |
| _, _ = io.Copy(io.Discard, resp.Body) | |
| _ = resp.Body.Close() | |
| }() |
|
|
||
| get_launcher_total_instances() { | ||
| local launcher_pod="$1" | ||
| kubectl exec -n "$NS" "$launcher_pod" -- python3 -c 'import json,urllib.request; print(json.load(urllib.request.urlopen("http://127.0.0.1:8001/v2/vllm/instances"))["total_instances"])' |
There was a problem hiding this comment.
get_launcher_total_instances uses a Python urlopen() with no timeout. Since it’s invoked inside expect loops, a hung HTTP call (or a stuck kubectl exec) can stall the entire E2E run indefinitely. Consider adding an explicit timeout (e.g., urlopen(..., timeout=...) and/or wrapping kubectl exec with timeout / --request-timeout).
| kubectl exec -n "$NS" "$launcher_pod" -- python3 -c 'import json,urllib.request; print(json.load(urllib.request.urlopen("http://127.0.0.1:8001/v2/vllm/instances"))["total_instances"])' | |
| kubectl --request-timeout=15s exec -n "$NS" "$launcher_pod" -- python3 -c 'import json,urllib.request; print(json.load(urllib.request.urlopen("http://127.0.0.1:8001/v2/vllm/instances", timeout=10))["total_instances"])' |
|
|
||
| # Mutate isc in a hash-relevant way so its sleeping instance becomes obsolete. | ||
| kubectl patch inferenceserverconfig "$isc" -n "$NS" --type=merge -p='{"spec":{"modelServerConfig":{"options":"--model HuggingFaceTB/SmolLM2-360M-Instruct --served-model-name obsolete-after-update"}}}' | ||
|
|
||
| expect '[ "$(get_launcher_total_instances "$launcher1")" == "$((old_total_instances - 1))" ]' | ||
|
|
There was a problem hiding this comment.
This test mutates the primary InferenceServerConfig ($isc) but never restores it. Even if the current script doesn’t reuse $isc, leaving cluster-scoped test objects mutated makes the E2E script less self-contained and can complicate later cases/debugging. Consider capturing the original options (as done for $isc2) and restoring them after the assertion.
| # Mutate isc in a hash-relevant way so its sleeping instance becomes obsolete. | |
| kubectl patch inferenceserverconfig "$isc" -n "$NS" --type=merge -p='{"spec":{"modelServerConfig":{"options":"--model HuggingFaceTB/SmolLM2-360M-Instruct --served-model-name obsolete-after-update"}}}' | |
| expect '[ "$(get_launcher_total_instances "$launcher1")" == "$((old_total_instances - 1))" ]' | |
| original_isc_options="$(kubectl get inferenceserverconfig "$isc" -n "$NS" -o jsonpath='{.spec.modelServerConfig.options}')" | |
| restore_primary_isc_options() { | |
| kubectl patch inferenceserverconfig "$isc" -n "$NS" --type=merge --patch "$(cat <<EOF | |
| spec: | |
| modelServerConfig: | |
| options: >- | |
| $original_isc_options | |
| EOF | |
| )" | |
| } | |
| trap 'restore_primary_isc_options' EXIT | |
| # Mutate isc in a hash-relevant way so its sleeping instance becomes obsolete. | |
| kubectl patch inferenceserverconfig "$isc" -n "$NS" --type=merge -p='{"spec":{"modelServerConfig":{"options":"--model HuggingFaceTB/SmolLM2-360M-Instruct --served-model-name obsolete-after-update"}}}' | |
| expect '[ "$(get_launcher_total_instances "$launcher1")" == "$((old_total_instances - 1))" ]' | |
| restore_primary_isc_options | |
| trap - EXIT |
| ctl.mutex.Lock() | ||
| defer ctl.mutex.Unlock() | ||
| nodeDat := ctl.nodeNameToData[nodeName] | ||
| if nodeDat == nil { | ||
| return | ||
| } | ||
| launcherDat := nodeDat.Launchers[launcherPodName] | ||
| if launcherDat == nil { | ||
| return | ||
| } | ||
| delete(launcherDat.Instances, instanceID) |
There was a problem hiding this comment.
deleteInstanceFromLauncherData takes ctl.mutex and then reads nodeDat.Launchers / mutates launcherDat.Instances, but those maps are also accessed elsewhere without ctl.mutex (they’re intended to be accessed only from the nodeItem.process call hierarchy). Locking ctl.mutex here doesn’t make those maps safe and can still race with concurrent node processing, especially now that cleanup can run async. Align on one synchronization strategy (e.g., only touch Launchers/Instances inside nodeItem processing, or guard them with a shared mutex used by all code paths).
| ctl.mutex.Lock() | |
| defer ctl.mutex.Unlock() | |
| nodeDat := ctl.nodeNameToData[nodeName] | |
| if nodeDat == nil { | |
| return | |
| } | |
| launcherDat := nodeDat.Launchers[launcherPodName] | |
| if launcherDat == nil { | |
| return | |
| } | |
| delete(launcherDat.Instances, instanceID) | |
| // Intentionally do not touch nodeDat.Launchers / launcherDat.Instances here. | |
| // Those maps are maintained by the node-processing path and are also accessed | |
| // without ctl.mutex elsewhere, so mutating them from async cleanup would race. | |
| // | |
| // The launcher instance has already been deleted from the cluster; controller | |
| // state will be refreshed by the regular reconciliation flow. | |
| _ = nodeName | |
| _ = launcherPodName | |
| _ = instanceID |
| return false | ||
| } | ||
| instState, err := lClient.GetInstanceState(ctx, serverDat.InstanceID) | ||
| if err != nil { |
There was a problem hiding this comment.
maybeDeleteObsoleteInstance returns false for any GetInstanceState error. If the instance was already deleted (e.g., by concurrent cleanup) and the launcher returns 404, ensureUnbound will fall through to the normal sleep path and likely fail the /sleep POST (connection refused), causing retries and noisy errors. Treat an instance-not-found response as “already gone”: update internal launcher state accordingly and skip the sleep call.
| if err != nil { | |
| if err != nil { | |
| if IsInstanceNotFoundError(err) { | |
| nodeName, _ := getProviderNodeName(providingPod) | |
| ctl.deleteInstanceFromLauncherData(nodeName, providingPod.Name, serverDat.InstanceID) | |
| logger.V(2).Info("Instance already absent during unbinding; removed stale launcher state", | |
| "instanceID", serverDat.InstanceID) | |
| return true | |
| } |
Signed-off-by: Jun Duan <jun.duan.phd@outlook.com>
Good. Q4 and A2 are stated colloquially in terms of one relevant edge, but the proper framing of the question and answer are state based. What, if anything, should be done about a vllm instance that was created by a non-current revision of the relevant InferenceServerConfig? Answer: deletion if the vllm instance is sleeping. |
| // There are two types of item in the controller's work queue. | ||
| // There are three types of item in the controller's work queue. |
There was a problem hiding this comment.
This commentary about queuing was already outdated before this PR, not mentioning cmItem nor unboundLauncherItem nor nodeItem and its nested structure.
This PR carries out 'A2 to Q4' in #201 (comment).
This PR does more than that. The PR also covers corresponding instances that are not sleeping when an ISC changes --- they are garbage collected when unbounding.