[Kubernetes] Respect configured pod tolerations in node-health display by rohansonecha · Pull Request #9750 · skypilot-org/skypilot

rohansonecha · 2026-05-28T22:56:39Z

Summary

Today the dashboard's Infra page, sky show-gpus, and any caller of get_kubernetes_node_info() flag a node as not-ready (and report 0 free GPUs) whenever it carries any taint that isn't already in the silently-filtered set (cordon, not-ready, PreferNoSchedule, GPU/TPU resource keys, role prefixes). This is wrong for the common case where the user has tolerated a custom taint via kubernetes.pod_config.spec.tolerations: jobs schedule fine, but the dashboard still says the node is broken.

This PR layers user-toleration awareness on top of the existing taint extraction:

New get_configured_tolerations(context) reads kubernetes.pod_config.spec.tolerations via get_effective_region_config, mirroring the existing get_allowed_nodes_config. Returns None when nothing is configured so downstream call sites are byte-identical to today.
New taint_is_tolerated(taint, tolerations) implements standard Kubernetes toleration semantics (Equal vs. Exists; empty effect matches all; empty key + Exists is a wildcard).
V1Node.get_taints(...) gains one optional tolerations= kwarg. When passed, each retained taint dict gets a 'tolerated': bool key. Existing exclude_* filtering is unchanged.
get_kubernetes_node_info() and _get_total_accelerators() thread the configured tolerations through and compute node_is_tainted = any(not t.tolerated for t in node_taints). The separate is_ready/is_cordoned gating is intentionally untouched — a cordoned or not-ready node still reports 0 free GPUs and keeps its dedicated status badge.
Infra page narrows the existing gray taint chip to un-tolerated taints; tolerated-only nodes fall into the existing green "Healthy" rendering (no new chip type).

Test plan

23 new unit tests in tests/unit_tests/kubernetes/test_kubernetes_utils.py:
- test_taint_is_tolerated - 14-row parametrized table covering Equal/Exists, empty key + Exists wildcard, empty effect matches all, value/effect mismatch, multiple-toleration any-match, and defensive handling of malformed entries.
- test_get_configured_tolerations_* - None / configured / explicit-empty / malformed-garbage paths.
- test_v1node_get_taints_tolerations_* - tolerations=None returns byte-identical dicts to today; tolerations=[] marks every retained taint tolerated=False; matching tolerations mark the right entries True; exclude_* filtering still applies regardless of tolerations.
Full tests/unit_tests/kubernetes/test_kubernetes_utils.py passes (165 tests).
Dashboard builds clean with npm run build in sky/dashboard.
Manual e2e on a live Kubernetes cluster (taint a node + configure a matching toleration + confirm sky show-gpus and Infra page reflect the node as Healthy) - to do before merge.

Also runnable via /quicktest-core and /smoke-test --kubernetes on this PR.

devin-ai-integration

Devin Review found 2 potential issues.

🐛 2 issues in files not directly in the diff

🐛 CLI `_count_not_ready_gpus` not updated to respect `tolerated` flag, inflating "not ready" GPU counts (`sky/client/cli/command.py:4233`)

The PR updated the taint-checking logic in kubernetes_catalog.py:220 and utils.py:4165 to use any(not t.get('tolerated', False) for t in node_taints), but the CLI at sky/client/cli/command.py:4233 still uses len(node_taints) > 0. After this PR, get_kubernetes_node_info() stores taint dicts with 'tolerated': True in KubernetesNodeInfo.taints for tolerated taints. The CLI's _count_not_ready_gpus counts these nodes as "not ready" even though the backend treats them as schedulable. This causes sky show-gpus --cloud kubernetes to show spurious "(N not ready)" annotations and reduced effective capacity for users with configured pod tolerations.

⚠️ CLI per-node status display shows tolerated taints as unhealthy indicators (`sky/client/cli/command.py:4474-4490`)

The per-node table in the CLI (sky/client/cli/command.py:4474-4490) shows all taints in the status column without filtering out tolerated ones. A node whose only taints are tolerated would be displayed with taint status (e.g., NoSchedule Taint [workload_pool]) instead of Healthy. This is inconsistent with both the dashboard (which filters to untoleratedTaints at sky/dashboard/src/components/infra.jsx:888-889) and the backend logic. Users see a node flagged as unhealthy in the CLI while the same node shows as healthy in the dashboard.

View 3 additional findings in Devin Review.

Today the dashboard's Infra page, `sky show-gpus`, and any caller of `get_kubernetes_node_info()` treats *any* node taint (after the existing PreferNoSchedule / cordon / not-ready / handled-key / role-prefix filtering) as unschedulable. That's wrong for the common case where a user has configured `kubernetes.pod_config.spec.tolerations` to tolerate a custom taint on their nodes: jobs schedule fine, but the dashboard shows the nodes as not-ready and `show-gpus` reports 0 available GPUs. Add a thin layer that respects those configured tolerations when deciding whether a remaining taint makes a node unschedulable for the user's workloads: - New `get_configured_tolerations(context)` reads `kubernetes.pod_config.spec.tolerations` via the existing `get_effective_region_config` plumbing (so per-context overrides Just Work). Returns `None` when nothing is configured, which makes downstream call sites byte-identical to today. - New `taint_is_tolerated(taint, tolerations)` implements standard K8s toleration semantics (Equal vs. Exists; empty effect matches all; empty key + Exists is a wildcard). - `V1Node.get_taints()` gains one optional `tolerations=` kwarg. When passed (a list, possibly empty), each retained taint dict gets a `'tolerated': bool` key. Existing `exclude_*` filtering is unchanged. Omitted / None preserves today's exact wire shape. - `get_kubernetes_node_info()` and `_get_total_accelerators()` thread the configured tolerations through and recompute `node_is_tainted` as `any(not t.tolerated for t in node_taints)`. The separate `is_ready` / `is_cordoned` gating is untouched. - Infra page dashboard: narrow the existing gray taint chip to un-tolerated taints; tolerated-only nodes fall into the existing green "Healthy" rendering. No new chip type. Tested: - Added 23 unit tests in `tests/unit_tests/kubernetes/test_kubernetes_utils.py` covering `taint_is_tolerated` (14 K8s-semantics rows), `get_configured_tolerations` (None / configured / empty / malformed), and `V1Node.get_taints(tolerations=...)` (no-tolerations is byte-identical; empty list marks all tolerated=False; matching toleration marks tolerated=True; exclude_* filtering still applies). - Full kubernetes_utils unit-test file passes (165 tests). - Dashboard builds clean with `npm run build` in `sky/dashboard`. - Manual e2e verification on a live Kubernetes cluster (taint a node + configure a matching toleration + confirm `sky show-gpus` and Infra page reflect the node as Healthy) is planned before merge.

rohansonecha · 2026-05-29T00:16:43Z

Addressed review comments in 85b0912265:

gemini @ sky/provision/kubernetes/utils.py:1625 — YAML int/bool coercion: taint_is_tolerated now wraps every taint/toleration key, value, effect, and operator in str() before comparing. Added two unit tests (test_taint_is_tolerated_coerces_yaml_int_value_to_str, ..._bool_..._str) that fail without the coercion.
devin @ sky/provision/kubernetes/utils.py:1585 — get_configured_tolerations doesn't replicate resolve_effective_pod_config merge semantics: Rewrote to fetch ('pod_config',) (the whole dict) instead of ('pod_config', 'spec', 'tolerations') (the leaf list). This triggers the K8s-specific dict merge (merge_k8s_configs) so per-context toleration lists are appended onto the global list rather than clobbering it — matching what an actual pod gets at scheduling time. Added two regression tests:
- test_get_configured_tolerations_fetches_pod_config_dict — asserts we call get_effective_region_config with keys=('pod_config',), not the leaf path.
- test_get_configured_tolerations_extracts_from_pod_config_dict — asserts we correctly extract spec.tolerations out of the merged dict.

Hoist the un-tolerated-taint computation above the `if taints:` guard since an empty/None taints list already produces an empty list comprehension, then drop the inner `if untolerated_taints:` since the outer one already gates rendering. Equivalent behavior, one less level of nesting. Addresses review comment skypilot-org#9750 (discussion_r3321369713).

The previous `str(x or '')` idiom collapses Python-falsy values to '' — `str(0 or '')` → `''`, `str(False or '')` → `''`. A YAML-parsed `value: 0` toleration would therefore fail to match a K8s taint whose value is the string `'0'`, falsely marking the node un-tolerated. Switch to `'' if v is None else str(v)` (via a local helper) which distinguishes 'absent' from 'set to a falsy value'. The Equal operator default keeps its `or 'Equal'` since '' is not a valid operator. Added regression tests for `value: 0` (int) and `value: false` (bool), plus an inverse-collapse case (value-less taint, '0' toleration must NOT match). Addresses review comment skypilot-org#9750 (discussion_r3321364550).

Applied 5 cleanups from a /simplify pass: 1. Extracted `has_untolerated_taint(taints)` helper in `sky/provision/kubernetes/utils.py`. The `any(not t.get('tolerated', False) for t in taints)` predicate was copy-pasted in 3 Python sites (catalog `_get_total_accelerators`, utils `get_kubernetes_node_info`, CLI `_count_not_ready_gpus`) — collapsed to a single call. Added a parametrized unit test. 2. `get_configured_tolerations` now delegates to the existing `resolve_effective_pod_config(...)` instead of reimplementing the same `get_effective_region_config(keys=('pod_config',))` + dict lookup chain. Same K8s-specific dict-merge for per-context tolerations; fewer lines of duplication; relies on the canonical reader. 3. Moved `_str_or_empty` from a closure inside `taint_is_tolerated` to module scope. Avoids constructing a function object on every matcher call. 4. Parametrized the 5 `*_coerces_*` unit tests into one `test_taint_is_tolerated_str_coercion` with 5 ids (yaml-int / yaml-bool / yaml-int-zero / yaml-bool-false / inverse-collapse). Same regression coverage, ~40 LOC less. 5. Extracted `isNodeNotReadyForGpus(nodeData)` helper in `sky/dashboard/src/data/connectors/infra.jsx`. The isReady/isCordoned/isTainted/isNodeNotReady block was duplicated across two near-identical loops — both call sites now share one helper, eliminating drift risk. 178 unit tests pass; dashboard build clean.

kevinmingtarja

LGTM once comments are addressed!

kevinmingtarja · 2026-05-29T06:46:30Z

/quicktest-core --kubernetes

@kevinmingtarja

…ntext Two fixes from @kevinmingtarja: 1. `get_configured_tolerations` now passes `cloud=clouds.SSH()` to `resolve_effective_pod_config` for `ssh-<pool>` contexts. Previously it always passed `cloud=None`, which made `resolve_effective_pod_config` read the `kubernetes.*` config namespace and skip the `ssh-` prefix-stripping. For an SSH-pool context that meant SSH-scoped tolerations would be missed and a global `kubernetes.pod_config` toleration would leak onto SSH node health. Auto-detect via `context.startswith('ssh-')` to keep the helper's signature unchanged. 2. `get_kubernetes_node_info` direct-path branch now resolves `context=None` to `get_current_kube_config_context_name()` before reading tolerations. `get_kubernetes_nodes(context=None)` internally resolves to the current kube context, but `get_configured_tolerations(None)` would have skipped `context_configs` overrides — so a user with a per-context toleration on the current context would not see those nodes honored. The plugin branch above already resolves the same way. Added two regression tests: `*_ssh_context_uses_ssh_namespace` asserts the underlying `get_effective_region_config` is called with `(cloud='ssh', region='cluster1')` for an `ssh-cluster1` context. Addresses skypilot-org#9750 (discussion_r3322597523 + r3322597524).

rohansonecha self-assigned this May 28, 2026

This comment was marked as resolved.

Sign in to view

devin-ai-integration Bot reviewed May 28, 2026

View reviewed changes

rohansonecha force-pushed the sky-4496-tolerated-taints branch from 6542c05 to 8fe9e3b Compare May 28, 2026 23:04

This comment was marked as resolved.

Sign in to view

rohansonecha force-pushed the sky-4496-tolerated-taints branch from 8fe9e3b to 52c4697 Compare May 28, 2026 23:53

rohansonecha force-pushed the sky-4496-tolerated-taints branch from 52c4697 to 85b0912 Compare May 29, 2026 00:14

rohansonecha requested a review from kevinmingtarja May 29, 2026 00:16

This comment was marked as resolved.

Sign in to view

rohansonecha commented May 29, 2026

View reviewed changes

Comment thread sky/client/cli/command.py Outdated

rohansonecha added 3 commits May 29, 2026 00:24

kevinmingtarja approved these changes May 29, 2026

View reviewed changes

Comment thread sky/provision/kubernetes/utils.py

Comment thread sky/provision/kubernetes/utils.py

rohansonecha merged commit 476a60b into skypilot-org:master May 29, 2026
23 checks passed

rohansonecha mentioned this pull request May 29, 2026

[Kubernetes] Coerce Python bool tolerations to lowercase for taint match #9756

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Kubernetes] Respect configured pod tolerations in node-health display#9750

[Kubernetes] Respect configured pod tolerations in node-health display#9750
rohansonecha merged 5 commits into
skypilot-org:masterfrom
rohansonecha:sky-4496-tolerated-taints

rohansonecha commented May 28, 2026 •

edited by devin-ai-integration Bot

Loading

Uh oh!

This comment was marked as resolved.

Uh oh!

devin-ai-integration Bot left a comment

Uh oh!

This comment was marked as resolved.

Uh oh!

rohansonecha commented May 29, 2026 •

edited

Loading

Uh oh!

This comment was marked as resolved.

Uh oh!

Uh oh!

kevinmingtarja left a comment

Uh oh!

Uh oh!

Uh oh!

kevinmingtarja commented May 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

rohansonecha commented May 28, 2026 • edited by devin-ai-integration Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

This comment was marked as resolved.

Uh oh!

devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

🐛 CLI _count_not_ready_gpus not updated to respect tolerated flag, inflating "not ready" GPU counts (sky/client/cli/command.py:4233)

⚠️ CLI per-node status display shows tolerated taints as unhealthy indicators (sky/client/cli/command.py:4474-4490)

Uh oh!

This comment was marked as resolved.

Uh oh!

rohansonecha commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as resolved.

Uh oh!

Uh oh!

kevinmingtarja left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

kevinmingtarja commented May 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rohansonecha commented May 28, 2026 •

edited by devin-ai-integration Bot

Loading

🐛 CLI `_count_not_ready_gpus` not updated to respect `tolerated` flag, inflating "not ready" GPU counts (`sky/client/cli/command.py:4233`)

⚠️ CLI per-node status display shows tolerated taints as unhealthy indicators (`sky/client/cli/command.py:4474-4490`)

rohansonecha commented May 29, 2026 •

edited

Loading