VirtLauncherPodsStuckFailed

Meaning

This alert fires when a large number of virt-launcher Pods remain in Failed phase.

Condition: cluster:kubevirt_virt_launcher_failed:count >= 200 for 10 minutes Virt-launcher Pods host VM workloads and mass failures can indicate migration loops, image/network/storage issues, or control-plane regressions.

Impact

VMs and the cluster control plane may be affected:

API server and etcd pressure (large object lists/watches, increased latency)
Controller and scheduler slowdown (reconciliation over huge Pod sets)
Monitoring cardinality spikes (kube-state-metrics and Prometheus load)
Operational churn (re-creation loops, CNI and storage attach/detach)
Triage noise and SLO risk (timeouts on list operations, noisy dashboards)

Diagnosis

Confirm scope and distribution:

cluster:kubevirt_virt_launcher_failed:count

count by (namespace) (kube_pod_status_phase{phase="Failed", pod=~"virt-launcher-.*"} == 1)

count by (node) (
  (kube_pod_status_phase{phase="Failed", pod=~"virt-launcher-.*"} == 1)
  * on(pod) group_left(node) kube_pod_info{pod=~"virt-launcher-.*", node!=""}
)

topk(5, count by (reason) (kube_pod_container_status_last_terminated_reason{pod=~"virt-launcher-.*"} == 1))

Sample failed Pods and events:

# List a few failed virt-launcher pods cluster-wide
kubectl get pods -A -l kubevirt.io=virt-launcher --field-selector=status.phase=Failed --no-headers | head -n 20

# Inspect events for a representative pod (image/CNI/storage/useful errors)
kubectl -n <namespace> describe pod <virt-launcher-pod> | sed -n '/Events/,$p'

Check for migration storms:

kubectl get vmim -A

Control plane and component logs (look for spikes/errors):

NAMESPACE="$(kubectl get kubevirt -A -o jsonpath='{.items[0].metadata.namespace}')"

kubectl -n "$NAMESPACE" logs -l kubevirt.io=virt-controller --tail=200

kubectl -n "$NAMESPACE" logs -l kubevirt.io=virt-handler --tail=200

Infrastructure checks (common causes):

Image pulls: registry reachability/credentials; ImagePullBackOff events
Network: CNI errors/timeouts in Pod events and node logs
Storage: volume attach/mount errors in Pod events and CSI logs

Mitigation

Reduce blast radius:

Migration loop: cancel in-flight migrations (scope as needed)

kubectl delete vmim -A

Coordinate with noisy tenants; pause offending workloads if necessary.

Clean up Failed Pods (relieves API/etcd and monitoring):

kubectl get pods -A -l kubevirt.io=virt-launcher --field-selector=status.phase=Failed -o name | xargs -r -n50 kubectl delete

Resolve root cause:

Image issues: fix registry access, credentials, or tags; re-run affected workloads.
Network/CNI: fix CNI/data-plane errors; confirm new Pods start cleanly.
Storage: resolve attach/mount failures; verify PVC/VolumeSnapshot health.
KubeVirt regression: roll forward/back to a known-good version and re-try.

Validate resolution (alert clears):

cluster:kubevirt_virt_launcher_failed:count

Ensure the failed count drops and stays below threshold and that new virt-launcher Pods start successfully and VMIs are healthy.

If you cannot resolve the issue, see the following resources:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VirtLauncherPodsStuckFailed

Meaning

Impact

Diagnosis

Mitigation

FilesExpand file tree

VirtLauncherPodsStuckFailed.md

Latest commit

History

VirtLauncherPodsStuckFailed.md

File metadata and controls

VirtLauncherPodsStuckFailed

Meaning

Impact

Diagnosis

Mitigation