Skip to content

[Kubernetes] Give late-binding PVCs more provision time; don't misreport WaitForFirstConsumer as a binding failure#9961

Open
zpoint wants to merge 1 commit into
skypilot-org:masterfrom
zpoint:fix/k8s-pvc-provision-timeout-waitforfirstconsumer
Open

[Kubernetes] Give late-binding PVCs more provision time; don't misreport WaitForFirstConsumer as a binding failure#9961
zpoint wants to merge 1 commit into
skypilot-org:masterfrom
zpoint:fix/k8s-pvc-provision-timeout-waitforfirstconsumer

Conversation

@zpoint

@zpoint zpoint commented Jun 26, 2026

Copy link
Copy Markdown
Collaborator

Summary

Fixes a flake in test_volumes_on_kubernetes (nightly k8s smoke tests) and the misleading error behind it.

The test launches a pod with 4 ReadWriteOnce PVCs on kind's standard storage class (rancher.io/local-path, volumeBindingMode: WaitForFirstConsumer). Such a PVC stays Pending by design until its consumer pod is scheduled, after which the provisioner creates and binds the volume.

Two problems caused the flake:

  1. provision_timeout was only 10s for ReadWriteOnce PVCs. The 180–240s bump in _calculate_provision_timeout applied only to READ_WRITE_MANY. On a loaded CI host, WaitForFirstConsumer binding + scheduling for several PVCs exceeds 10s, so _wait_for_pods_to_schedule times out → failover → launch fails.
  2. The timeout was mislabeled as a PVC binding failure. _get_pvc_binding_status flagged any Pending PVC — including one waiting purely on WaitForFirstConsumer — as PVC binding issue detected …, which is normal late binding in progress, not a failure. (This also contradicted the existing benign WaitForFirstConsumer classification in instance.py.)

Evidence (measured on the actual CI VM)

Scenario Launch time Outcome
Idle host (live measurement) pod scheduled 6.1s / 4 PVCs bound 5.1s
Idle host (passing retry) ~6.0s (04:53:31.5 → 04:53:37.5) ✅ under 10s
Under nightly load (failing run) >10.5s (04:23:35.554 → 04:23:46.088) ❌ hit the 10s gate

The failing run consumed ~60% of the 10s budget on an idle box; nightly contention across 5 kind clusters on one host tips it over. The same commit passed on retry with no code change → timing flake.

Changes

  • sky/clouds/kubernetes.py: extend provision_timeout to 30s for any PVC-backed launch (not just RWX). 30s ≈ 5× the idle baseline and 3× the default, while still failing fast for genuinely unschedulable pods.
  • sky/provision/kubernetes/instance.py: in _get_pvc_binding_status, treat a PVC that is Pending solely on a WaitForFirstConsumer event (no Warning/ProvisioningFailed) as late binding in progress, not a binding failure. Genuine failures still emit Warning/ProvisioningFailed and are reported as before.

Test plan

  • pytest tests/unit_tests/kubernetes/test_provision.py -k pvc — all pass, incl. new test_get_pvc_binding_status_wait_for_first_consumer and the existing genuine-failure test (unchanged behavior).
  • pytest tests/unit_tests/kubernetes/test_provision_timeout.py — new, covers RWO/RWX/mixed/non-PVC/queueing cases.
  • format.sh clean (yapf/isort/mypy/pylint).
  • Manual verification on the failing CI host: confirmed standard SC is WaitForFirstConsumer/local-path and measured the 6s provisioning latency.

Failed build: https://buildkite.com/skypilot-1/smoke-tests/builds/11456

🤖 Generated with Claude Code

…ort WaitForFirstConsumer as a binding failure

test_volumes_on_kubernetes flakes on the kind CI host. The test launches a
pod with 4 ReadWriteOnce PVCs on kind's `standard` storage class
(rancher.io/local-path, volumeBindingMode: WaitForFirstConsumer). Such a
PVC stays Pending by design until its consumer pod is scheduled, after
which the provisioner creates and binds the volume.

The provision_timeout for ReadWriteOnce PVCs was only 10s (the 180-240s
bump in _calculate_provision_timeout applied solely to READ_WRITE_MANY).
On a loaded host the WaitForFirstConsumer binding + scheduling for several
PVCs exceeds 10s, so _wait_for_pods_to_schedule times out and
_raise_pod_scheduling_errors mislabels it as a "PVC binding issue detected
... WaitForFirstConsumer" error, failing the launch.

Measured on the actual CI VM (idle): pod scheduled in ~6.1s, 4 PVCs bound
in ~5.1s — 60% of the 10s budget before any contention. The failing run hit
exactly the 10s gate (Launching 04:23:35.554 -> raise 04:23:46.088); the
passing retry provisioned in ~6.0s.

Two fixes:
- Extend provision_timeout to 30s for any PVC-backed launch (not just RWX),
  since late-binding RWO PVCs also need more than the 10s default before
  failover.
- In _get_pvc_binding_status, treat a PVC that is Pending solely on a
  WaitForFirstConsumer event (no Warning/ProvisioningFailed) as normal
  late binding in progress rather than a binding failure. Genuine failures
  still emit Warning/ProvisioningFailed events and are reported. This also
  removes the contradiction with the existing WaitForFirstConsumer benign
  classification in instance.py.

Failed build: https://buildkite.com/skypilot-1/smoke-tests/builds/11456

Tested:
- pytest tests/unit_tests/kubernetes/test_provision.py -k pvc (all pass,
  incl. new test_get_pvc_binding_status_wait_for_first_consumer and the
  existing genuine-failure test)
- pytest tests/unit_tests/kubernetes/test_provision_timeout.py (new)
- format.sh (yapf/isort/mypy clean)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the Kubernetes cloud provider implementation to better handle late-binding storage classes (e.g., WaitForFirstConsumer). Specifically, it increases the base provision timeout to 30 seconds for ReadWriteOnce (RWO) PVCs to prevent healthy-but-slow provisions from being misread as failures. It also updates the PVC binding status check to ignore benign WaitForFirstConsumer events unless actual warning or provisioning failure events are present. Finally, it adds comprehensive unit tests to verify these timeout and binding status behaviors. I have no feedback to provide.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

@zpoint

zpoint commented Jun 26, 2026

Copy link
Copy Markdown
Collaborator Author

/smoke-test -k test_volumes_on_kubernetes --kubernetes

@DanielZhangQD DanielZhangQD left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks! @zpoint

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants