-
Notifications
You must be signed in to change notification settings - Fork 42
Add runbook for VirtLauncherPodsStuckFailed alert #330
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
sradco
merged 1 commit into
kubevirt:main
from
sradco:add_runbook_for_VirtLauncherPodsStuckFailed
Dec 15, 2025
Merged
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,118 @@ | ||
| # VirtLauncherPodsStuckFailed | ||
|
|
||
| ## Meaning | ||
|
|
||
| This alert fires when a large number of virt-launcher Pods remain in Failed phase. | ||
|
|
||
| Condition: `cluster:kubevirt_virt_launcher_failed:count >= 200` for 10 minutes | ||
| Virt-launcher Pods host VM workloads and mass failures can indicate migration loops, | ||
| image/network/storage issues, or control-plane regressions. | ||
|
|
||
| ## Impact | ||
|
|
||
| VMs and the cluster control plane may be affected: | ||
|
|
||
| - API server and etcd pressure (large object lists/watches, increased latency) | ||
| - Controller and scheduler slowdown (reconciliation over huge Pod sets) | ||
| - Monitoring cardinality spikes (kube-state-metrics and Prometheus load) | ||
| - Operational churn (re-creation loops, CNI and storage attach/detach) | ||
| - Triage noise and SLO risk (timeouts on list operations, noisy dashboards) | ||
|
|
||
| ## Diagnosis | ||
|
|
||
| 1. Confirm scope and distribution: | ||
|
|
||
| ```promql | ||
| cluster:kubevirt_virt_launcher_failed:count | ||
| ``` | ||
| ```promql | ||
| count by (namespace) (kube_pod_status_phase{phase="Failed", pod=~"virt-launcher-.*"} == 1) | ||
| ``` | ||
| ```promql | ||
| count by (node) ( | ||
| (kube_pod_status_phase{phase="Failed", pod=~"virt-launcher-.*"} == 1) | ||
| * on(pod) group_left(node) kube_pod_info{pod=~"virt-launcher-.*", node!=""} | ||
| ) | ||
| ``` | ||
| ```promql | ||
| topk(5, count by (reason) (kube_pod_container_status_last_terminated_reason{pod=~"virt-launcher-.*"} == 1)) | ||
| ``` | ||
|
|
||
| 2. Sample failed Pods and events: | ||
|
|
||
| ```bash | ||
| # List a few failed virt-launcher pods cluster-wide | ||
| kubectl get pods -A -l kubevirt.io=virt-launcher --field-selector=status.phase=Failed --no-headers | head -n 20 | ||
| ``` | ||
| ```bash | ||
| # Inspect events for a representative pod (image/CNI/storage/useful errors) | ||
| kubectl -n <namespace> describe pod <virt-launcher-pod> | sed -n '/Events/,$p' | ||
| ``` | ||
|
|
||
| 3. Check for migration storms: | ||
|
|
||
| ```bash | ||
| kubectl get vmim -A | ||
| ``` | ||
|
|
||
| 4. Control plane and component logs (look for spikes/errors): | ||
|
|
||
| ```bash | ||
| NAMESPACE="$(kubectl get kubevirt -A -o jsonpath='{.items[0].metadata.namespace}')" | ||
| ``` | ||
| ```bash | ||
| kubectl -n "$NAMESPACE" logs -l kubevirt.io=virt-controller --tail=200 | ||
| ``` | ||
| ```bash | ||
| kubectl -n "$NAMESPACE" logs -l kubevirt.io=virt-handler --tail=200 | ||
| ``` | ||
|
|
||
| 5. Infrastructure checks (common causes): | ||
|
|
||
| - Image pulls: registry reachability/credentials; ImagePullBackOff events | ||
| - Network: CNI errors/timeouts in Pod events and node logs | ||
| - Storage: volume attach/mount errors in Pod events and CSI logs | ||
|
|
||
| ## Mitigation | ||
|
|
||
| 1. Reduce blast radius: | ||
|
|
||
| - Migration loop: cancel in-flight migrations (scope as needed) | ||
|
|
||
| ```bash | ||
| kubectl delete vmim -A | ||
| ``` | ||
|
|
||
| - Coordinate with noisy tenants; pause offending workloads if necessary. | ||
|
|
||
| 2. Clean up Failed Pods (relieves API/etcd and monitoring): | ||
|
|
||
| ```bash | ||
| kubectl get pods -A -l kubevirt.io=virt-launcher --field-selector=status.phase=Failed -o name | xargs -r -n50 kubectl delete | ||
| ``` | ||
|
|
||
| 3. Resolve root cause: | ||
|
|
||
| - Image issues: fix registry access, credentials, or tags; re-run affected workloads. | ||
| - Network/CNI: fix CNI/data-plane errors; confirm new Pods start cleanly. | ||
| - Storage: resolve attach/mount failures; verify PVC/VolumeSnapshot health. | ||
| - KubeVirt regression: roll forward/back to a known-good version and re-try. | ||
|
|
||
| 4. Validate resolution (alert clears): | ||
|
|
||
| ```promql | ||
| cluster:kubevirt_virt_launcher_failed:count | ||
| ``` | ||
|
|
||
| Ensure the failed count drops and stays below threshold | ||
| and that new virt-launcher Pods start successfully and VMIs are healthy. | ||
|
|
||
| <!--DS: If you cannot resolve the issue, log in to the | ||
| link:https://access.redhat.com[Customer Portal] and open a support case, | ||
| attaching the artifacts gathered during the diagnosis procedure.--> | ||
| <!--USstart--> | ||
| If you cannot resolve the issue, see the following resources: | ||
|
|
||
| - [OKD Help](https://okd.io/docs/community/help/) | ||
| - [#virtualization Slack channel](https://kubernetes.slack.com/channels/virtualization) | ||
| <!--USend--> | ||
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.