[Kubernetes] Give late-binding PVCs more provision time; don't misreport WaitForFirstConsumer as a binding failure#9961
Conversation
…ort WaitForFirstConsumer as a binding failure test_volumes_on_kubernetes flakes on the kind CI host. The test launches a pod with 4 ReadWriteOnce PVCs on kind's `standard` storage class (rancher.io/local-path, volumeBindingMode: WaitForFirstConsumer). Such a PVC stays Pending by design until its consumer pod is scheduled, after which the provisioner creates and binds the volume. The provision_timeout for ReadWriteOnce PVCs was only 10s (the 180-240s bump in _calculate_provision_timeout applied solely to READ_WRITE_MANY). On a loaded host the WaitForFirstConsumer binding + scheduling for several PVCs exceeds 10s, so _wait_for_pods_to_schedule times out and _raise_pod_scheduling_errors mislabels it as a "PVC binding issue detected ... WaitForFirstConsumer" error, failing the launch. Measured on the actual CI VM (idle): pod scheduled in ~6.1s, 4 PVCs bound in ~5.1s — 60% of the 10s budget before any contention. The failing run hit exactly the 10s gate (Launching 04:23:35.554 -> raise 04:23:46.088); the passing retry provisioned in ~6.0s. Two fixes: - Extend provision_timeout to 30s for any PVC-backed launch (not just RWX), since late-binding RWO PVCs also need more than the 10s default before failover. - In _get_pvc_binding_status, treat a PVC that is Pending solely on a WaitForFirstConsumer event (no Warning/ProvisioningFailed) as normal late binding in progress rather than a binding failure. Genuine failures still emit Warning/ProvisioningFailed events and are reported. This also removes the contradiction with the existing WaitForFirstConsumer benign classification in instance.py. Failed build: https://buildkite.com/skypilot-1/smoke-tests/builds/11456 Tested: - pytest tests/unit_tests/kubernetes/test_provision.py -k pvc (all pass, incl. new test_get_pvc_binding_status_wait_for_first_consumer and the existing genuine-failure test) - pytest tests/unit_tests/kubernetes/test_provision_timeout.py (new) - format.sh (yapf/isort/mypy clean) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Code Review
This pull request updates the Kubernetes cloud provider implementation to better handle late-binding storage classes (e.g., WaitForFirstConsumer). Specifically, it increases the base provision timeout to 30 seconds for ReadWriteOnce (RWO) PVCs to prevent healthy-but-slow provisions from being misread as failures. It also updates the PVC binding status check to ignore benign WaitForFirstConsumer events unless actual warning or provisioning failure events are present. Finally, it adds comprehensive unit tests to verify these timeout and binding status behaviors. I have no feedback to provide.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
|
/smoke-test -k test_volumes_on_kubernetes --kubernetes |
DanielZhangQD
left a comment
There was a problem hiding this comment.
LGTM! Thanks! @zpoint
Summary
Fixes a flake in
test_volumes_on_kubernetes(nightly k8s smoke tests) and the misleading error behind it.The test launches a pod with 4 ReadWriteOnce PVCs on kind's
standardstorage class (rancher.io/local-path,volumeBindingMode: WaitForFirstConsumer). Such a PVC staysPendingby design until its consumer pod is scheduled, after which the provisioner creates and binds the volume.Two problems caused the flake:
provision_timeoutwas only 10s for ReadWriteOnce PVCs. The 180–240s bump in_calculate_provision_timeoutapplied only toREAD_WRITE_MANY. On a loaded CI host,WaitForFirstConsumerbinding + scheduling for several PVCs exceeds 10s, so_wait_for_pods_to_scheduletimes out → failover → launch fails._get_pvc_binding_statusflagged anyPendingPVC — including one waiting purely onWaitForFirstConsumer— asPVC binding issue detected …, which is normal late binding in progress, not a failure. (This also contradicted the existing benignWaitForFirstConsumerclassification ininstance.py.)Evidence (measured on the actual CI VM)
04:53:31.5 → 04:53:37.5)04:23:35.554 → 04:23:46.088)The failing run consumed ~60% of the 10s budget on an idle box; nightly contention across 5 kind clusters on one host tips it over. The same commit passed on retry with no code change → timing flake.
Changes
sky/clouds/kubernetes.py: extendprovision_timeoutto 30s for any PVC-backed launch (not just RWX). 30s ≈ 5× the idle baseline and 3× the default, while still failing fast for genuinely unschedulable pods.sky/provision/kubernetes/instance.py: in_get_pvc_binding_status, treat a PVC that isPendingsolely on aWaitForFirstConsumerevent (noWarning/ProvisioningFailed) as late binding in progress, not a binding failure. Genuine failures still emitWarning/ProvisioningFailedand are reported as before.Test plan
pytest tests/unit_tests/kubernetes/test_provision.py -k pvc— all pass, incl. newtest_get_pvc_binding_status_wait_for_first_consumerand the existing genuine-failure test (unchanged behavior).pytest tests/unit_tests/kubernetes/test_provision_timeout.py— new, covers RWO/RWX/mixed/non-PVC/queueing cases.format.shclean (yapf/isort/mypy/pylint).standardSC isWaitForFirstConsumer/local-path and measured the 6s provisioning latency.Failed build: https://buildkite.com/skypilot-1/smoke-tests/builds/11456
🤖 Generated with Claude Code