Skip to content

Conversation

@pkazmierczak
Copy link
Contributor

@pkazmierczak pkazmierczak commented May 28, 2025

TestSingleAffinities never expected a node with affinity score set to 0 in
the set of returned nodes. However, since #25800, this can happen. What the
test should be checking for instead is that the node with the highest normalized
score has the right affinity.

Internal ref: https://hashicorp.atlassian.net/browse/NMD-797

This is the very first e2e test we run, and since I cannot reproduce
this problem, my guess is we might be checking the alloc metric score
too quickly on the GHA runner?
Copy link
Member

@tgross tgross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem like it's fixing the right problem. Isn't the allocation being created in the same plan that records the score metadata? If this is flaky, is it possible that we're scheduling onto a host with an unexpected score because of result limiter?

@pkazmierczak pkazmierczak marked this pull request as draft May 28, 2025 15:05
@pkazmierczak
Copy link
Contributor Author

result limiter

what's a result limiter?

@pkazmierczak
Copy link
Contributor Author

My suspicions are around client readiness. I cannot make this test fail no matter how hard I try if I manually run it against the e2e cluster. But it's the very first test that we run. Could it be that a dc1 node is just not ready when this is run? Or does it not make sense?

@tgross
Copy link
Member

tgross commented May 28, 2025

what's a result limiter?

Sorry, I mean the LimitIterator. The scheduler doesn't check all the nodes, a only checks a number up to a certain limit (2 for the generic scheduler for non-spread workloads) and then picks the best of those options. So I'm suggesting that we're iterating over 2 nodes, and finding that the best score includes a 0 for node-affinity rather than 1. Which strongly makes me think we're hitting a side-effect of #25800. It would be interesting to look at the rest of the scores.

My suspicions are around client readiness. I cannot make this test fail no matter how hard I try if I manually run it against the e2e cluster. But it's the very first test that we run. Could it be that a dc1 node is just not ready when this is run? Or does it not make sense?

That's possible but that still implies that it's about which particular set of nodes is ready and which order we're looking at them.

@pkazmierczak pkazmierczak changed the title e2e: retry on TestSingleAffinities e2e: correct TestSingleAffinities behavior May 30, 2025
@pkazmierczak pkazmierczak marked this pull request as ready for review May 30, 2025 16:25
@pkazmierczak pkazmierczak requested a review from tgross May 30, 2025 16:25
Copy link
Member

@tgross tgross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@pkazmierczak pkazmierczak merged commit 348177d into main May 30, 2025
34 checks passed
@pkazmierczak pkazmierczak deleted the b-affinity-single-e2e branch May 30, 2025 17:46
@github-actions
Copy link

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 28, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants