fix: prevent DaemonSet pods from circling in Failed/Pending during disruption #2729

moko-poi · 2025-12-20T12:44:58Z

Fixes #2009

Description

This PR fixes an issue where DaemonSet pods (such as aws-node and kube-proxy) enter a Failed/Pending cycle during node disruption (consolidation/termination).

Root Cause:
When Karpenter marks a node for disruption, it applies the karpenter.sh/disrupted:NoSchedule taint to prevent new pods from scheduling. However, DaemonSet pods were being evicted during this process. When the DaemonSet controller attempts to recreate these pods, the NoSchedule taint prevents them from being scheduled back to the node, causing them to remain in Pending state. This creates a continuous cycle of pod failures and rescheduling attempts until the node is fully terminated.

Solution:
Modified the IsEvictable() and IsDrainable() functions in pkg/utils/pod/scheduling.go to explicitly exclude DaemonSet pods from eviction during node disruption. This approach aligns with kubectl drain behavior, where DaemonSet pods remain running until the node is actually terminated, allowing the DaemonSet controller to naturally manage pod recreation on other available nodes.

Changes:

Added !IsOwnedByDaemonSet(pod) check to IsEvictable() function
Added !IsOwnedByDaemonSet(pod) check to IsDrainable() function
Added test case to verify DaemonSet pods are not evicted during disruption
Updated existing test case for DaemonSet pods with PDBs to reflect the new behavior

How was this change tested?

Unit Tests: Added comprehensive test cases in both disruption and termination controllers:
- should not evict daemonset pods during node disruption - Verifies DaemonSet pods remain running when disruption taint is applied
- should consider candidates with only daemonset pods - Verifies nodes with only DaemonSet pods can be disrupted
- Updated should consider candidates that have fully blocking PDBs on daemonset pods - Verifies PDBs don't block disruption when only DaemonSet pods are present
Test Execution: All existing tests pass with these changes:
```
make test FOCUS="daemonset"
```
- 3 DaemonSet-related tests: PASSED
- All pkg tests: PASSED with no failures
Race Detection: Tests executed with -race flag to ensure no data races introduced

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

k8s-ci-robot · 2025-12-20T12:45:08Z

Hi @moko-poi. Thanks for your PR.

I'm waiting for a github.com member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot · 2025-12-20T12:45:10Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: moko-poi
Once this PR has been reviewed and has the lgtm label, please assign maciekpytel for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

coveralls · 2025-12-21T03:39:04Z

Pull Request Test Coverage Report for Build 20403988644

Details

5 of 5 (100.0%) changed or added relevant lines in 1 file are covered.
9 unchanged lines in 2 files lost coverage.
Overall coverage decreased (-0.06%) to 80.275%

Files with Coverage Reduction	New Missed Lines	%
pkg/controllers/node/termination/terminator/terminator.go	2	90.2%
pkg/controllers/provisioning/scheduling/preferences.go	7	88.76%

Totals
Change from base Build 20384548641:	-0.06%
Covered Lines:	11961
Relevant Lines:	14900

💛 - Coveralls

jmdeal · 2026-01-05T19:00:37Z

When the DaemonSet controller attempts to recreate these pods, the NoSchedule taint prevents them from being scheduled back to the node, causing them to remain in Pending state.

If I understand correctly this is the core issue, right? The issue isn't that daemonsets are being evicted, it's that some daemonset pods are being recreated and entering a pending / failed state. What's not clear to me is why they're being recreated - the daemonset controller should only create a pod for a node if it tolerates the taints. If it does tolerate the taint, Karpenter shouldn't have disrupted the pod in the first place due to the existing IsEvictable check.

I think there are probably use-cases where we want to drain daemonsets. Some daemonsets perform resource cleanup which may not be possible once the node is terminating. An example that comes to mind is the EBS CSI driver which cleans up VolumeAttachment objects during termination. For this reason I don't think we'd want to exclude daemonsets from the drain process altogether, but we should identify the root cause for these pods being recreated.

k8s-ci-robot requested review from tallaxes and tzneal December 20, 2025 12:45

k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Dec 20, 2025

moko-poi force-pushed the fix/issue-2009-daemonset-disruption branch from 61a2634 to 9038648 Compare December 20, 2025 12:50

k8s-ci-robot removed the do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. label Dec 20, 2025

Fix DaemonSet pods circling in Failed/Pending during disruption

39d56e1

moko-poi force-pushed the fix/issue-2009-daemonset-disruption branch from 9038648 to 39d56e1 Compare December 21, 2025 03:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: prevent DaemonSet pods from circling in Failed/Pending during disruption #2729

fix: prevent DaemonSet pods from circling in Failed/Pending during disruption #2729

moko-poi commented Dec 20, 2025

Uh oh!

k8s-ci-robot commented Dec 20, 2025

Uh oh!

k8s-ci-robot commented Dec 20, 2025

Uh oh!

coveralls commented Dec 21, 2025

Uh oh!

jmdeal commented Jan 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

fix: prevent DaemonSet pods from circling in Failed/Pending during disruption #2729

Are you sure you want to change the base?

fix: prevent DaemonSet pods from circling in Failed/Pending during disruption #2729

Conversation

moko-poi commented Dec 20, 2025

Uh oh!

k8s-ci-robot commented Dec 20, 2025

Uh oh!

k8s-ci-robot commented Dec 20, 2025

Uh oh!

coveralls commented Dec 21, 2025

Pull Request Test Coverage Report for Build 20403988644

Details

💛 - Coveralls

Uh oh!

jmdeal commented Jan 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants