Skip to content

Garbage collect obsolete vLLM instances#428

Open
waltforme wants to merge 3 commits intollm-d-incubation:mainfrom
waltforme:gc-insts
Open

Garbage collect obsolete vLLM instances#428
waltforme wants to merge 3 commits intollm-d-incubation:mainfrom
waltforme:gc-insts

Conversation

@waltforme
Copy link
Copy Markdown
Collaborator

@waltforme waltforme commented Apr 15, 2026

This PR carries out 'A2 to Q4' in #201 (comment).

This PR does more than that. The PR also covers corresponding instances that are not sleeping when an ISC changes --- they are garbage collected when unbounding.

Signed-off-by: Jun Duan <jun.duan.phd@outlook.com>
…unbinding

Signed-off-by: Jun Duan <jun.duan.phd@outlook.com>
Copilot AI review requested due to automatic review settings April 15, 2026 04:22
@waltforme waltforme marked this pull request as draft April 15, 2026 04:23
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds controller-side garbage collection for vLLM instances that become obsolete after an InferenceServerConfig (ISC) change, covering both sleeping instances (cleanup after ISC update) and awake instances (delete on unbind rather than sleeping). It also extends the E2E suite to validate these behaviors.

Changes:

  • Add async cleanup of obsolete sleeping launcher instances triggered by ISC spec updates.
  • On unbind, detect obsolete awake launcher instances and delete them instead of sleeping.
  • Extend E2E test-cases to validate obsolete-instance deletion scenarios.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 7 comments.

File Description
test/e2e/test-cases.sh Adds helper + new E2E cases validating deletion of obsolete sleeping/awake instances.
pkg/controller/dual-pods/inference-server.go Adds “delete obsolete instance on unbind” logic; threads context.Context into HTTP GET helpers.
pkg/controller/dual-pods/controller.go Triggers obsolete sleeping-instance cleanup on ISC updates; adds internal state deletion helper.

Comment thread pkg/controller/dual-pods/inference-server.go
resp, err := http.Post(sleepURL, "", nil)
if err != nil {
return fmt.Errorf("failed to put provider %q to sleep, POST %s got error: %w", serverDat.ProvidingPodName, sleepURL, err)
}
Copy link

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ensureUnbound calls http.Post but doesn’t close resp.Body. This can leak connections/file descriptors over time in a long-running controller. Ensure resp.Body.Close() is called (typically via defer) after a successful POST, even when only checking StatusCode.

Suggested change
}
}
defer func() {
_, _ = io.Copy(io.Discard, resp.Body)
_ = resp.Body.Close()
}()

Copilot uses AI. Check for mistakes.
Comment thread test/e2e/test-cases.sh

get_launcher_total_instances() {
local launcher_pod="$1"
kubectl exec -n "$NS" "$launcher_pod" -- python3 -c 'import json,urllib.request; print(json.load(urllib.request.urlopen("http://127.0.0.1:8001/v2/vllm/instances"))["total_instances"])'
Copy link

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_launcher_total_instances uses a Python urlopen() with no timeout. Since it’s invoked inside expect loops, a hung HTTP call (or a stuck kubectl exec) can stall the entire E2E run indefinitely. Consider adding an explicit timeout (e.g., urlopen(..., timeout=...) and/or wrapping kubectl exec with timeout / --request-timeout).

Suggested change
kubectl exec -n "$NS" "$launcher_pod" -- python3 -c 'import json,urllib.request; print(json.load(urllib.request.urlopen("http://127.0.0.1:8001/v2/vllm/instances"))["total_instances"])'
kubectl --request-timeout=15s exec -n "$NS" "$launcher_pod" -- python3 -c 'import json,urllib.request; print(json.load(urllib.request.urlopen("http://127.0.0.1:8001/v2/vllm/instances", timeout=10))["total_instances"])'

Copilot uses AI. Check for mistakes.
Comment thread test/e2e/test-cases.sh
Comment on lines +478 to +483

# Mutate isc in a hash-relevant way so its sleeping instance becomes obsolete.
kubectl patch inferenceserverconfig "$isc" -n "$NS" --type=merge -p='{"spec":{"modelServerConfig":{"options":"--model HuggingFaceTB/SmolLM2-360M-Instruct --served-model-name obsolete-after-update"}}}'

expect '[ "$(get_launcher_total_instances "$launcher1")" == "$((old_total_instances - 1))" ]'

Copy link

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test mutates the primary InferenceServerConfig ($isc) but never restores it. Even if the current script doesn’t reuse $isc, leaving cluster-scoped test objects mutated makes the E2E script less self-contained and can complicate later cases/debugging. Consider capturing the original options (as done for $isc2) and restoring them after the assertion.

Suggested change
# Mutate isc in a hash-relevant way so its sleeping instance becomes obsolete.
kubectl patch inferenceserverconfig "$isc" -n "$NS" --type=merge -p='{"spec":{"modelServerConfig":{"options":"--model HuggingFaceTB/SmolLM2-360M-Instruct --served-model-name obsolete-after-update"}}}'
expect '[ "$(get_launcher_total_instances "$launcher1")" == "$((old_total_instances - 1))" ]'
original_isc_options="$(kubectl get inferenceserverconfig "$isc" -n "$NS" -o jsonpath='{.spec.modelServerConfig.options}')"
restore_primary_isc_options() {
kubectl patch inferenceserverconfig "$isc" -n "$NS" --type=merge --patch "$(cat <<EOF
spec:
modelServerConfig:
options: >-
$original_isc_options
EOF
)"
}
trap 'restore_primary_isc_options' EXIT
# Mutate isc in a hash-relevant way so its sleeping instance becomes obsolete.
kubectl patch inferenceserverconfig "$isc" -n "$NS" --type=merge -p='{"spec":{"modelServerConfig":{"options":"--model HuggingFaceTB/SmolLM2-360M-Instruct --served-model-name obsolete-after-update"}}}'
expect '[ "$(get_launcher_total_instances "$launcher1")" == "$((old_total_instances - 1))" ]'
restore_primary_isc_options
trap - EXIT

Copilot uses AI. Check for mistakes.
Comment thread pkg/controller/dual-pods/controller.go Outdated
Comment on lines +802 to +812
ctl.mutex.Lock()
defer ctl.mutex.Unlock()
nodeDat := ctl.nodeNameToData[nodeName]
if nodeDat == nil {
return
}
launcherDat := nodeDat.Launchers[launcherPodName]
if launcherDat == nil {
return
}
delete(launcherDat.Instances, instanceID)
Copy link

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deleteInstanceFromLauncherData takes ctl.mutex and then reads nodeDat.Launchers / mutates launcherDat.Instances, but those maps are also accessed elsewhere without ctl.mutex (they’re intended to be accessed only from the nodeItem.process call hierarchy). Locking ctl.mutex here doesn’t make those maps safe and can still race with concurrent node processing, especially now that cleanup can run async. Align on one synchronization strategy (e.g., only touch Launchers/Instances inside nodeItem processing, or guard them with a shared mutex used by all code paths).

Suggested change
ctl.mutex.Lock()
defer ctl.mutex.Unlock()
nodeDat := ctl.nodeNameToData[nodeName]
if nodeDat == nil {
return
}
launcherDat := nodeDat.Launchers[launcherPodName]
if launcherDat == nil {
return
}
delete(launcherDat.Instances, instanceID)
// Intentionally do not touch nodeDat.Launchers / launcherDat.Instances here.
// Those maps are maintained by the node-processing path and are also accessed
// without ctl.mutex elsewhere, so mutating them from async cleanup would race.
//
// The launcher instance has already been deleted from the cluster; controller
// state will be refreshed by the regular reconciliation flow.
_ = nodeName
_ = launcherPodName
_ = instanceID

Copilot uses AI. Check for mistakes.
return false
}
instState, err := lClient.GetInstanceState(ctx, serverDat.InstanceID)
if err != nil {
Copy link

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybeDeleteObsoleteInstance returns false for any GetInstanceState error. If the instance was already deleted (e.g., by concurrent cleanup) and the launcher returns 404, ensureUnbound will fall through to the normal sleep path and likely fail the /sleep POST (connection refused), causing retries and noisy errors. Treat an instance-not-found response as “already gone”: update internal launcher state accordingly and skip the sleep call.

Suggested change
if err != nil {
if err != nil {
if IsInstanceNotFoundError(err) {
nodeName, _ := getProviderNodeName(providingPod)
ctl.deleteInstanceFromLauncherData(nodeName, providingPod.Name, serverDat.InstanceID)
logger.V(2).Info("Instance already absent during unbinding; removed stale launcher state",
"instanceID", serverDat.InstanceID)
return true
}

Copilot uses AI. Check for mistakes.
Signed-off-by: Jun Duan <jun.duan.phd@outlook.com>
@waltforme waltforme marked this pull request as ready for review April 16, 2026 18:12
@MikeSpreitzer
Copy link
Copy Markdown
Collaborator

This PR carries out 'A2 to Q4' in #201 (comment).

This PR does more than that. The PR also covers corresponding instances that are not sleeping when an ISC changes --- they are garbage collected when unbounding.

Good. Q4 and A2 are stated colloquially in terms of one relevant edge, but the proper framing of the question and answer are state based. What, if anything, should be done about a vllm instance that was created by a non-current revision of the relevant InferenceServerConfig? Answer: deletion if the vllm instance is sleeping.

Comment on lines -78 to +79
// There are two types of item in the controller's work queue.
// There are three types of item in the controller's work queue.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This commentary about queuing was already outdated before this PR, not mentioning cmItem nor unboundLauncherItem nor nodeItem and its nested structure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants