Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(source): optional exclusion of unschedulable nodes #5045

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

Hayajiro
Copy link

Description

This fixes a behavioral regression introduced in #4761, where nodes that were previously added to DNS are removed when they are considered unschedulable, for example due to automated maintenance tasks.

This change will introduce a new flag called exclude-unschedulable, which defaults to true in order to keep in line with the current behavior. However, it would also be reasonable to restore the initial behavior before #4761, which would mean to not exclude unschedulable nodes by default.

Fixes #4991

Checklist

  • Unit tests updated
  • End user documentation updated

Copy link

linux-foundation-easycla bot commented Jan 29, 2025

CLA Signed

The committers listed above are authorized under a signed CLA.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. label Jan 29, 2025
@k8s-ci-robot
Copy link
Contributor

Welcome @Hayajiro!

It looks like this is your first PR to kubernetes-sigs/external-dns 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/external-dns has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Jan 29, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @Hayajiro. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Jan 29, 2025
@mloiseleur
Copy link
Collaborator

@Hayajiro you need to sign the CLA in order for this PR to be reviewed.

@Hayajiro
Copy link
Author

@Hayajiro you need to sign the CLA in order for this PR to be reviewed.

I'm fully aware of this, but CLA signing is still in process.

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Mar 5, 2025
@Hayajiro Hayajiro force-pushed the 4991-configurable-handling-of-node-notready branch from 44aaa51 to 66fcece Compare March 5, 2025 09:41
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 5, 2025
@Hayajiro
Copy link
Author

Hayajiro commented Mar 5, 2025

CLA is finally signed. 🎉 Also rebased with master and fixed the conflicts.

So I guess this is ready for review now?

@ivankatliarchuk
Copy link
Contributor

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Mar 6, 2025
@@ -102,7 +104,7 @@ func (ns *nodeSource) Endpoints(ctx context.Context) ([]*endpoint.Endpoint, erro
continue
}

if node.Spec.Unschedulable {
if node.Spec.Unschedulable && ns.excludeUnschedulable {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add tests that capture debug logging?

func LogsToBuffer(level log.Level, t *testing.T) *bytes.Buffer {

You could search for example usages in the code.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late response, but I've been pretty busy the last couple weeks.

I've now added an example of how this could be implemented in the existing test method (which would allow the other testcases to check for logs, as well. Not quite ideal, as this requires disabling parallelism for these testcases, but the alternative would be having to duplicate more of the test logic, which I'm not really a fan of, either.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 12, 2025
@Hayajiro Hayajiro force-pushed the 4991-configurable-handling-of-node-notready branch from 66fcece to 3d75a27 Compare March 24, 2025 13:23
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Mar 24, 2025
@Hayajiro
Copy link
Author

Not sure why there are apparently data races, as I can't really reproduce them locally...

@ivankatliarchuk
Copy link
Contributor

Feels like our logger is not implemented correctly for tests. Could you try something like here

if len(tt.logExpectations) == 0 {
and here
b = testutils.LogsToBuffer(log.DebugLevel, t)

If not going to work, will test logging next time, when I have a more stable framework

@Hayajiro
Copy link
Author

Hayajiro commented Mar 25, 2025

When running with make test PARALLEL=64 ITERATION=1000, I was able to reproduce the CI failure, allowing me to take a closer look.

Turns out the problem lies with the fact that, after calling to LogsToBuffer, we don't reset the output buffers, which means we're carrying around stale references. These, in turn, seem to sometimes cause these data races (which seem to only occur if tests are run with enough parallelization and iterations).

When calling to {,k}log.SetOutput(io.Discard), these races seem to no longer occur.

@ivankatliarchuk How do you want to proceed with this? While the fix itself seems to be fairly simple, I'm not quite sure if this is still in scope for this PR.

Edit: Okay, now I can't reproduce this locally anymore and the CI is also green. I feel like this is definitely a flake.

@ivankatliarchuk
Copy link
Contributor

ivankatliarchuk commented Mar 25, 2025

Yeah, is another bug that not related to current work/fix.

Copy link
Contributor

@ivankatliarchuk ivankatliarchuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

cc: @mloiseleur

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 25, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: ivankatliarchuk
Once this PR has been reviewed and has the lgtm label, please assign raffo for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 27, 2025
This fixes a behavioral regression introduced in kubernetes-sigs#4761, where
nodes that were previously added to DNS are removed when they are considered
unschedulable, for example due to automated maintenance tasks.

This change will introduce a new flag called `exclude-unschedulable`, which
defaults to `true` in order to keep in line with the current behavior.
However, it would also be reasonable to restore the initial behavior before
This commit adds the required logic to be able to test for
the existence (and absence) of certain log messages
in testNodeSourceEndpoints. As an example, this is implemented
for the tests around excludeUnschedulable.

A side effect of using LogsToBuffer is that tests can't run in
parallel due to the log buffer being shared across all
parallel test cases. As such, these specific tests are now executed
one after another.
@Hayajiro Hayajiro force-pushed the 4991-configurable-handling-of-node-notready branch from 2b4e844 to 8f61c0f Compare March 27, 2025 08:14
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 27, 2025
@k8s-ci-robot
Copy link
Contributor

New changes are detected. LGTM label has been removed.

@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 27, 2025
@ivankatliarchuk
Copy link
Contributor

Hi @Hayajiro any chance not to force push? I'm not sure now what was reviewed

@ivankatliarchuk
Copy link
Contributor

/label tide/merge-method-squash

@k8s-ci-robot k8s-ci-robot added the tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. label Mar 27, 2025
@Hayajiro
Copy link
Author

Hi @Hayajiro any chance not to force push? I'm not sure now what was reviewed

Well, it was stated that the MR needs rebasing, sooo....

@mloiseleur
Copy link
Collaborator

/retitle feat(source): optional exclusion of unschedulable nodes

@k8s-ci-robot k8s-ci-robot changed the title feat(source/node): Make exclusion of unschedulable Nodes configurable feat(source): optional exclusion of unschedulable nodes Mar 27, 2025
@mloiseleur
Copy link
Collaborator

@Hayajiro PR LGTM. Documentation was updated in #4761. Before we merge it, would you please update node source documentation accordingly, explaining the use case ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Configurable handling of node NotReady
4 participants