Skip to content

[Kubernetes] Respect configured pod tolerations in node-health display#9750

Merged
rohansonecha merged 5 commits into
skypilot-org:masterfrom
rohansonecha:sky-4496-tolerated-taints
May 29, 2026
Merged

[Kubernetes] Respect configured pod tolerations in node-health display#9750
rohansonecha merged 5 commits into
skypilot-org:masterfrom
rohansonecha:sky-4496-tolerated-taints

Conversation

@rohansonecha
Copy link
Copy Markdown
Collaborator

@rohansonecha rohansonecha commented May 28, 2026

Summary

Today the dashboard's Infra page, sky show-gpus, and any caller of get_kubernetes_node_info() flag a node as not-ready (and report 0 free GPUs) whenever it carries any taint that isn't already in the silently-filtered set (cordon, not-ready, PreferNoSchedule, GPU/TPU resource keys, role prefixes). This is wrong for the common case where the user has tolerated a custom taint via kubernetes.pod_config.spec.tolerations: jobs schedule fine, but the dashboard still says the node is broken.

This PR layers user-toleration awareness on top of the existing taint extraction:

  • New get_configured_tolerations(context) reads kubernetes.pod_config.spec.tolerations via get_effective_region_config, mirroring the existing get_allowed_nodes_config. Returns None when nothing is configured so downstream call sites are byte-identical to today.
  • New taint_is_tolerated(taint, tolerations) implements standard Kubernetes toleration semantics (Equal vs. Exists; empty effect matches all; empty key + Exists is a wildcard).
  • V1Node.get_taints(...) gains one optional tolerations= kwarg. When passed, each retained taint dict gets a 'tolerated': bool key. Existing exclude_* filtering is unchanged.
  • get_kubernetes_node_info() and _get_total_accelerators() thread the configured tolerations through and compute node_is_tainted = any(not t.tolerated for t in node_taints). The separate is_ready/is_cordoned gating is intentionally untouched — a cordoned or not-ready node still reports 0 free GPUs and keeps its dedicated status badge.
  • Infra page narrows the existing gray taint chip to un-tolerated taints; tolerated-only nodes fall into the existing green "Healthy" rendering (no new chip type).

Test plan

  • 23 new unit tests in tests/unit_tests/kubernetes/test_kubernetes_utils.py:
    • test_taint_is_tolerated - 14-row parametrized table covering Equal/Exists, empty key + Exists wildcard, empty effect matches all, value/effect mismatch, multiple-toleration any-match, and defensive handling of malformed entries.
    • test_get_configured_tolerations_* - None / configured / explicit-empty / malformed-garbage paths.
    • test_v1node_get_taints_tolerations_* - tolerations=None returns byte-identical dicts to today; tolerations=[] marks every retained taint tolerated=False; matching tolerations mark the right entries True; exclude_* filtering still applies regardless of tolerations.
  • Full tests/unit_tests/kubernetes/test_kubernetes_utils.py passes (165 tests).
  • Dashboard builds clean with npm run build in sky/dashboard.
  • Manual e2e on a live Kubernetes cluster (taint a node + configure a matching toleration + confirm sky show-gpus and Infra page reflect the node as Healthy) - to do before merge.

Also runnable via /quicktest-core and /smoke-test --kubernetes on this PR.


Open in Devin Review

@rohansonecha rohansonecha self-assigned this May 28, 2026
gemini-code-assist[bot]

This comment was marked as resolved.

Copy link
Copy Markdown

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 2 potential issues.

🐛 2 issues in files not directly in the diff

🐛 CLI _count_not_ready_gpus not updated to respect tolerated flag, inflating "not ready" GPU counts (sky/client/cli/command.py:4233)

The PR updated the taint-checking logic in kubernetes_catalog.py:220 and utils.py:4165 to use any(not t.get('tolerated', False) for t in node_taints), but the CLI at sky/client/cli/command.py:4233 still uses len(node_taints) > 0. After this PR, get_kubernetes_node_info() stores taint dicts with 'tolerated': True in KubernetesNodeInfo.taints for tolerated taints. The CLI's _count_not_ready_gpus counts these nodes as "not ready" even though the backend treats them as schedulable. This causes sky show-gpus --cloud kubernetes to show spurious "(N not ready)" annotations and reduced effective capacity for users with configured pod tolerations.


⚠️ CLI per-node status display shows tolerated taints as unhealthy indicators (sky/client/cli/command.py:4474-4490)

The per-node table in the CLI (sky/client/cli/command.py:4474-4490) shows all taints in the status column without filtering out tolerated ones. A node whose only taints are tolerated would be displayed with taint status (e.g., NoSchedule Taint [workload_pool]) instead of Healthy. This is inconsistent with both the dashboard (which filters to untoleratedTaints at sky/dashboard/src/components/infra.jsx:888-889) and the backend logic. Users see a node flagged as unhealthy in the CLI while the same node shows as healthy in the dashboard.

View 3 additional findings in Devin Review.

Open in Devin Review

@rohansonecha rohansonecha force-pushed the sky-4496-tolerated-taints branch from 6542c05 to 8fe9e3b Compare May 28, 2026 23:04
devin-ai-integration[bot]

This comment was marked as resolved.

@rohansonecha rohansonecha force-pushed the sky-4496-tolerated-taints branch from 8fe9e3b to 52c4697 Compare May 28, 2026 23:53
Today the dashboard's Infra page, `sky show-gpus`, and any caller of
`get_kubernetes_node_info()` treats *any* node taint (after the existing
PreferNoSchedule / cordon / not-ready / handled-key / role-prefix
filtering) as unschedulable. That's wrong for the common case where a
user has configured `kubernetes.pod_config.spec.tolerations` to tolerate
a custom taint on their nodes: jobs schedule fine, but the dashboard
shows the nodes as not-ready and `show-gpus` reports 0 available GPUs.

Add a thin layer that respects those configured tolerations when
deciding whether a remaining taint makes a node unschedulable for the
user's workloads:

- New `get_configured_tolerations(context)` reads
  `kubernetes.pod_config.spec.tolerations` via the existing
  `get_effective_region_config` plumbing (so per-context overrides Just
  Work). Returns `None` when nothing is configured, which makes
  downstream call sites byte-identical to today.
- New `taint_is_tolerated(taint, tolerations)` implements standard K8s
  toleration semantics (Equal vs. Exists; empty effect matches all;
  empty key + Exists is a wildcard).
- `V1Node.get_taints()` gains one optional `tolerations=` kwarg. When
  passed (a list, possibly empty), each retained taint dict gets a
  `'tolerated': bool` key. Existing `exclude_*` filtering is unchanged.
  Omitted / None preserves today's exact wire shape.
- `get_kubernetes_node_info()` and `_get_total_accelerators()` thread
  the configured tolerations through and recompute `node_is_tainted` as
  `any(not t.tolerated for t in node_taints)`. The separate `is_ready`
  / `is_cordoned` gating is untouched.
- Infra page dashboard: narrow the existing gray taint chip to
  un-tolerated taints; tolerated-only nodes fall into the existing
  green "Healthy" rendering. No new chip type.

Tested:
- Added 23 unit tests in
  `tests/unit_tests/kubernetes/test_kubernetes_utils.py` covering
  `taint_is_tolerated` (14 K8s-semantics rows), `get_configured_tolerations`
  (None / configured / empty / malformed), and
  `V1Node.get_taints(tolerations=...)` (no-tolerations is byte-identical;
  empty list marks all tolerated=False; matching toleration marks
  tolerated=True; exclude_* filtering still applies).
- Full kubernetes_utils unit-test file passes (165 tests).
- Dashboard builds clean with `npm run build` in `sky/dashboard`.
- Manual e2e verification on a live Kubernetes cluster (taint a node +
  configure a matching toleration + confirm `sky show-gpus` and Infra
  page reflect the node as Healthy) is planned before merge.
@rohansonecha rohansonecha force-pushed the sky-4496-tolerated-taints branch from 52c4697 to 85b0912 Compare May 29, 2026 00:14
@rohansonecha
Copy link
Copy Markdown
Collaborator Author

rohansonecha commented May 29, 2026

Addressed review comments in 85b0912265:

  1. gemini @ sky/provision/kubernetes/utils.py:1625 — YAML int/bool coercion: taint_is_tolerated now wraps every taint/toleration key, value, effect, and operator in str() before comparing. Added two unit tests (test_taint_is_tolerated_coerces_yaml_int_value_to_str, ..._bool_..._str) that fail without the coercion.

  2. devin @ sky/provision/kubernetes/utils.py:1585get_configured_tolerations doesn't replicate resolve_effective_pod_config merge semantics: Rewrote to fetch ('pod_config',) (the whole dict) instead of ('pod_config', 'spec', 'tolerations') (the leaf list). This triggers the K8s-specific dict merge (merge_k8s_configs) so per-context toleration lists are appended onto the global list rather than clobbering it — matching what an actual pod gets at scheduling time. Added two regression tests:

    • test_get_configured_tolerations_fetches_pod_config_dict — asserts we call get_effective_region_config with keys=('pod_config',), not the leaf path.
    • test_get_configured_tolerations_extracts_from_pod_config_dict — asserts we correctly extract spec.tolerations out of the merged dict.

devin-ai-integration[bot]

This comment was marked as resolved.

Comment thread sky/client/cli/command.py Outdated
Hoist the un-tolerated-taint computation above the `if taints:` guard
since an empty/None taints list already produces an empty list
comprehension, then drop the inner `if untolerated_taints:` since the
outer one already gates rendering. Equivalent behavior, one less level
of nesting.

Addresses review comment skypilot-org#9750 (discussion_r3321369713).
The previous `str(x or '')` idiom collapses Python-falsy values to
'' — `str(0 or '')` → `''`, `str(False or '')` → `''`. A YAML-parsed
`value: 0` toleration would therefore fail to match a K8s taint whose
value is the string `'0'`, falsely marking the node un-tolerated.

Switch to `'' if v is None else str(v)` (via a local helper) which
distinguishes 'absent' from 'set to a falsy value'. The Equal operator
default keeps its `or 'Equal'` since '' is not a valid operator.

Added regression tests for `value: 0` (int) and `value: false` (bool),
plus an inverse-collapse case (value-less taint, '0' toleration must
NOT match).

Addresses review comment skypilot-org#9750 (discussion_r3321364550).
Applied 5 cleanups from a /simplify pass:

1. Extracted `has_untolerated_taint(taints)` helper in
   `sky/provision/kubernetes/utils.py`. The
   `any(not t.get('tolerated', False) for t in taints)` predicate was
   copy-pasted in 3 Python sites (catalog `_get_total_accelerators`,
   utils `get_kubernetes_node_info`, CLI `_count_not_ready_gpus`) —
   collapsed to a single call. Added a parametrized unit test.

2. `get_configured_tolerations` now delegates to the existing
   `resolve_effective_pod_config(...)` instead of reimplementing the
   same `get_effective_region_config(keys=('pod_config',))` + dict
   lookup chain. Same K8s-specific dict-merge for per-context
   tolerations; fewer lines of duplication; relies on the canonical
   reader.

3. Moved `_str_or_empty` from a closure inside `taint_is_tolerated`
   to module scope. Avoids constructing a function object on every
   matcher call.

4. Parametrized the 5 `*_coerces_*` unit tests into one
   `test_taint_is_tolerated_str_coercion` with 5 ids
   (yaml-int / yaml-bool / yaml-int-zero / yaml-bool-false /
   inverse-collapse). Same regression coverage, ~40 LOC less.

5. Extracted `isNodeNotReadyForGpus(nodeData)` helper in
   `sky/dashboard/src/data/connectors/infra.jsx`. The
   isReady/isCordoned/isTainted/isNodeNotReady block was duplicated
   across two near-identical loops — both call sites now share one
   helper, eliminating drift risk.

178 unit tests pass; dashboard build clean.
Copy link
Copy Markdown
Collaborator

@kevinmingtarja kevinmingtarja left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM once comments are addressed!

Comment thread sky/provision/kubernetes/utils.py
Comment thread sky/provision/kubernetes/utils.py
@kevinmingtarja
Copy link
Copy Markdown
Collaborator

/quicktest-core --kubernetes

…ntext

Two fixes from @kevinmingtarja:

1. `get_configured_tolerations` now passes `cloud=clouds.SSH()` to
   `resolve_effective_pod_config` for `ssh-<pool>` contexts.
   Previously it always passed `cloud=None`, which made
   `resolve_effective_pod_config` read the `kubernetes.*` config
   namespace and skip the `ssh-` prefix-stripping. For an SSH-pool
   context that meant SSH-scoped tolerations would be missed and a
   global `kubernetes.pod_config` toleration would leak onto SSH node
   health. Auto-detect via `context.startswith('ssh-')` to keep the
   helper's signature unchanged.

2. `get_kubernetes_node_info` direct-path branch now resolves
   `context=None` to `get_current_kube_config_context_name()` before
   reading tolerations. `get_kubernetes_nodes(context=None)`
   internally resolves to the current kube context, but
   `get_configured_tolerations(None)` would have skipped
   `context_configs` overrides — so a user with a per-context
   toleration on the current context would not see those nodes
   honored. The plugin branch above already resolves the same way.

Added two regression tests: `*_ssh_context_uses_ssh_namespace`
asserts the underlying `get_effective_region_config` is called with
`(cloud='ssh', region='cluster1')` for an `ssh-cluster1` context.

Addresses skypilot-org#9750 (discussion_r3322597523 + r3322597524).
@rohansonecha rohansonecha merged commit 476a60b into skypilot-org:master May 29, 2026
23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants