Skip to content

Latest commit

 

History

History
118 lines (87 loc) · 3.63 KB

File metadata and controls

118 lines (87 loc) · 3.63 KB

VirtLauncherPodsStuckFailed

Meaning

This alert fires when a large number of virt-launcher Pods remain in Failed phase.

Condition: cluster:kubevirt_virt_launcher_failed:count >= 200 for 10 minutes Virt-launcher Pods host VM workloads and mass failures can indicate migration loops, image/network/storage issues, or control-plane regressions.

Impact

VMs and the cluster control plane may be affected:

  • API server and etcd pressure (large object lists/watches, increased latency)
  • Controller and scheduler slowdown (reconciliation over huge Pod sets)
  • Monitoring cardinality spikes (kube-state-metrics and Prometheus load)
  • Operational churn (re-creation loops, CNI and storage attach/detach)
  • Triage noise and SLO risk (timeouts on list operations, noisy dashboards)

Diagnosis

  1. Confirm scope and distribution:
cluster:kubevirt_virt_launcher_failed:count
count by (namespace) (kube_pod_status_phase{phase="Failed", pod=~"virt-launcher-.*"} == 1)
count by (node) (
  (kube_pod_status_phase{phase="Failed", pod=~"virt-launcher-.*"} == 1)
  * on(pod) group_left(node) kube_pod_info{pod=~"virt-launcher-.*", node!=""}
)
topk(5, count by (reason) (kube_pod_container_status_last_terminated_reason{pod=~"virt-launcher-.*"} == 1))
  1. Sample failed Pods and events:
# List a few failed virt-launcher pods cluster-wide
kubectl get pods -A -l kubevirt.io=virt-launcher --field-selector=status.phase=Failed --no-headers | head -n 20
# Inspect events for a representative pod (image/CNI/storage/useful errors)
kubectl -n <namespace> describe pod <virt-launcher-pod> | sed -n '/Events/,$p'
  1. Check for migration storms:
kubectl get vmim -A
  1. Control plane and component logs (look for spikes/errors):
NAMESPACE="$(kubectl get kubevirt -A -o jsonpath='{.items[0].metadata.namespace}')"
kubectl -n "$NAMESPACE" logs -l kubevirt.io=virt-controller --tail=200
kubectl -n "$NAMESPACE" logs -l kubevirt.io=virt-handler --tail=200
  1. Infrastructure checks (common causes):
  • Image pulls: registry reachability/credentials; ImagePullBackOff events
  • Network: CNI errors/timeouts in Pod events and node logs
  • Storage: volume attach/mount errors in Pod events and CSI logs

Mitigation

  1. Reduce blast radius:
  • Migration loop: cancel in-flight migrations (scope as needed)
kubectl delete vmim -A
  • Coordinate with noisy tenants; pause offending workloads if necessary.
  1. Clean up Failed Pods (relieves API/etcd and monitoring):
kubectl get pods -A -l kubevirt.io=virt-launcher --field-selector=status.phase=Failed -o name | xargs -r -n50 kubectl delete
  1. Resolve root cause:
  • Image issues: fix registry access, credentials, or tags; re-run affected workloads.
  • Network/CNI: fix CNI/data-plane errors; confirm new Pods start cleanly.
  • Storage: resolve attach/mount failures; verify PVC/VolumeSnapshot health.
  • KubeVirt regression: roll forward/back to a known-good version and re-try.
  1. Validate resolution (alert clears):
cluster:kubevirt_virt_launcher_failed:count

Ensure the failed count drops and stays below threshold and that new virt-launcher Pods start successfully and VMIs are healthy.

If you cannot resolve the issue, see the following resources: