[Kubernetes] Respect configured pod tolerations in node-health display#9750
Conversation
There was a problem hiding this comment.
Devin Review found 2 potential issues.
🐛 2 issues in files not directly in the diff
🐛 CLI _count_not_ready_gpus not updated to respect tolerated flag, inflating "not ready" GPU counts (sky/client/cli/command.py:4233)
The PR updated the taint-checking logic in kubernetes_catalog.py:220 and utils.py:4165 to use any(not t.get('tolerated', False) for t in node_taints), but the CLI at sky/client/cli/command.py:4233 still uses len(node_taints) > 0. After this PR, get_kubernetes_node_info() stores taint dicts with 'tolerated': True in KubernetesNodeInfo.taints for tolerated taints. The CLI's _count_not_ready_gpus counts these nodes as "not ready" even though the backend treats them as schedulable. This causes sky show-gpus --cloud kubernetes to show spurious "(N not ready)" annotations and reduced effective capacity for users with configured pod tolerations.
⚠️ CLI per-node status display shows tolerated taints as unhealthy indicators (sky/client/cli/command.py:4474-4490)
The per-node table in the CLI (sky/client/cli/command.py:4474-4490) shows all taints in the status column without filtering out tolerated ones. A node whose only taints are tolerated would be displayed with taint status (e.g., NoSchedule Taint [workload_pool]) instead of Healthy. This is inconsistent with both the dashboard (which filters to untoleratedTaints at sky/dashboard/src/components/infra.jsx:888-889) and the backend logic. Users see a node flagged as unhealthy in the CLI while the same node shows as healthy in the dashboard.
View 3 additional findings in Devin Review.
6542c05 to
8fe9e3b
Compare
8fe9e3b to
52c4697
Compare
Today the dashboard's Infra page, `sky show-gpus`, and any caller of `get_kubernetes_node_info()` treats *any* node taint (after the existing PreferNoSchedule / cordon / not-ready / handled-key / role-prefix filtering) as unschedulable. That's wrong for the common case where a user has configured `kubernetes.pod_config.spec.tolerations` to tolerate a custom taint on their nodes: jobs schedule fine, but the dashboard shows the nodes as not-ready and `show-gpus` reports 0 available GPUs. Add a thin layer that respects those configured tolerations when deciding whether a remaining taint makes a node unschedulable for the user's workloads: - New `get_configured_tolerations(context)` reads `kubernetes.pod_config.spec.tolerations` via the existing `get_effective_region_config` plumbing (so per-context overrides Just Work). Returns `None` when nothing is configured, which makes downstream call sites byte-identical to today. - New `taint_is_tolerated(taint, tolerations)` implements standard K8s toleration semantics (Equal vs. Exists; empty effect matches all; empty key + Exists is a wildcard). - `V1Node.get_taints()` gains one optional `tolerations=` kwarg. When passed (a list, possibly empty), each retained taint dict gets a `'tolerated': bool` key. Existing `exclude_*` filtering is unchanged. Omitted / None preserves today's exact wire shape. - `get_kubernetes_node_info()` and `_get_total_accelerators()` thread the configured tolerations through and recompute `node_is_tainted` as `any(not t.tolerated for t in node_taints)`. The separate `is_ready` / `is_cordoned` gating is untouched. - Infra page dashboard: narrow the existing gray taint chip to un-tolerated taints; tolerated-only nodes fall into the existing green "Healthy" rendering. No new chip type. Tested: - Added 23 unit tests in `tests/unit_tests/kubernetes/test_kubernetes_utils.py` covering `taint_is_tolerated` (14 K8s-semantics rows), `get_configured_tolerations` (None / configured / empty / malformed), and `V1Node.get_taints(tolerations=...)` (no-tolerations is byte-identical; empty list marks all tolerated=False; matching toleration marks tolerated=True; exclude_* filtering still applies). - Full kubernetes_utils unit-test file passes (165 tests). - Dashboard builds clean with `npm run build` in `sky/dashboard`. - Manual e2e verification on a live Kubernetes cluster (taint a node + configure a matching toleration + confirm `sky show-gpus` and Infra page reflect the node as Healthy) is planned before merge.
52c4697 to
85b0912
Compare
|
Addressed review comments in
|
Hoist the un-tolerated-taint computation above the `if taints:` guard since an empty/None taints list already produces an empty list comprehension, then drop the inner `if untolerated_taints:` since the outer one already gates rendering. Equivalent behavior, one less level of nesting. Addresses review comment skypilot-org#9750 (discussion_r3321369713).
The previous `str(x or '')` idiom collapses Python-falsy values to '' — `str(0 or '')` → `''`, `str(False or '')` → `''`. A YAML-parsed `value: 0` toleration would therefore fail to match a K8s taint whose value is the string `'0'`, falsely marking the node un-tolerated. Switch to `'' if v is None else str(v)` (via a local helper) which distinguishes 'absent' from 'set to a falsy value'. The Equal operator default keeps its `or 'Equal'` since '' is not a valid operator. Added regression tests for `value: 0` (int) and `value: false` (bool), plus an inverse-collapse case (value-less taint, '0' toleration must NOT match). Addresses review comment skypilot-org#9750 (discussion_r3321364550).
Applied 5 cleanups from a /simplify pass:
1. Extracted `has_untolerated_taint(taints)` helper in
`sky/provision/kubernetes/utils.py`. The
`any(not t.get('tolerated', False) for t in taints)` predicate was
copy-pasted in 3 Python sites (catalog `_get_total_accelerators`,
utils `get_kubernetes_node_info`, CLI `_count_not_ready_gpus`) —
collapsed to a single call. Added a parametrized unit test.
2. `get_configured_tolerations` now delegates to the existing
`resolve_effective_pod_config(...)` instead of reimplementing the
same `get_effective_region_config(keys=('pod_config',))` + dict
lookup chain. Same K8s-specific dict-merge for per-context
tolerations; fewer lines of duplication; relies on the canonical
reader.
3. Moved `_str_or_empty` from a closure inside `taint_is_tolerated`
to module scope. Avoids constructing a function object on every
matcher call.
4. Parametrized the 5 `*_coerces_*` unit tests into one
`test_taint_is_tolerated_str_coercion` with 5 ids
(yaml-int / yaml-bool / yaml-int-zero / yaml-bool-false /
inverse-collapse). Same regression coverage, ~40 LOC less.
5. Extracted `isNodeNotReadyForGpus(nodeData)` helper in
`sky/dashboard/src/data/connectors/infra.jsx`. The
isReady/isCordoned/isTainted/isNodeNotReady block was duplicated
across two near-identical loops — both call sites now share one
helper, eliminating drift risk.
178 unit tests pass; dashboard build clean.
kevinmingtarja
left a comment
There was a problem hiding this comment.
LGTM once comments are addressed!
|
/quicktest-core --kubernetes |
…ntext Two fixes from @kevinmingtarja: 1. `get_configured_tolerations` now passes `cloud=clouds.SSH()` to `resolve_effective_pod_config` for `ssh-<pool>` contexts. Previously it always passed `cloud=None`, which made `resolve_effective_pod_config` read the `kubernetes.*` config namespace and skip the `ssh-` prefix-stripping. For an SSH-pool context that meant SSH-scoped tolerations would be missed and a global `kubernetes.pod_config` toleration would leak onto SSH node health. Auto-detect via `context.startswith('ssh-')` to keep the helper's signature unchanged. 2. `get_kubernetes_node_info` direct-path branch now resolves `context=None` to `get_current_kube_config_context_name()` before reading tolerations. `get_kubernetes_nodes(context=None)` internally resolves to the current kube context, but `get_configured_tolerations(None)` would have skipped `context_configs` overrides — so a user with a per-context toleration on the current context would not see those nodes honored. The plugin branch above already resolves the same way. Added two regression tests: `*_ssh_context_uses_ssh_namespace` asserts the underlying `get_effective_region_config` is called with `(cloud='ssh', region='cluster1')` for an `ssh-cluster1` context. Addresses skypilot-org#9750 (discussion_r3322597523 + r3322597524).
Summary
Today the dashboard's Infra page,
sky show-gpus, and any caller ofget_kubernetes_node_info()flag a node as not-ready (and report 0 free GPUs) whenever it carries any taint that isn't already in the silently-filtered set (cordon, not-ready,PreferNoSchedule, GPU/TPU resource keys, role prefixes). This is wrong for the common case where the user has tolerated a custom taint viakubernetes.pod_config.spec.tolerations: jobs schedule fine, but the dashboard still says the node is broken.This PR layers user-toleration awareness on top of the existing taint extraction:
get_configured_tolerations(context)readskubernetes.pod_config.spec.tolerationsviaget_effective_region_config, mirroring the existingget_allowed_nodes_config. ReturnsNonewhen nothing is configured so downstream call sites are byte-identical to today.taint_is_tolerated(taint, tolerations)implements standard Kubernetes toleration semantics (Equalvs.Exists; empty effect matches all; empty key +Existsis a wildcard).V1Node.get_taints(...)gains one optionaltolerations=kwarg. When passed, each retained taint dict gets a'tolerated': boolkey. Existingexclude_*filtering is unchanged.get_kubernetes_node_info()and_get_total_accelerators()thread the configured tolerations through and computenode_is_tainted = any(not t.tolerated for t in node_taints). The separateis_ready/is_cordonedgating is intentionally untouched — a cordoned or not-ready node still reports 0 free GPUs and keeps its dedicated status badge.Test plan
Also runnable via /quicktest-core and /smoke-test --kubernetes on this PR.