Skip to content

Conversation

@SoloJacobs
Copy link
Contributor

Repeating my commit message for convenience:

Two commits have already merged in order to address the flakiness of this test.
However, I can still reproduce the issue using:

go test -failfast -run "TestClusterJoinAndReconnect/TestJoinLeave" -count 600 ./cluster

An easy way to increase the failure rate is to increase CPU load, e.g.,

yes > /dev/null &; yes > /dev/null &; yes > /dev/null &; yes > /dev/null &

On my machine the combination of these commands fails every time.
The underlying reason for the failure is that the test only waits for p2 to be ready, but this does not reflect whether p has updated its memberlist. We can ensure that p has updated its memberlist by waiting for NotifyJoin to be called. The test is now slightly slower, 0.8 seconds on my machine.

Side-note

I am new to the project and feedback is very appreciated. It was hard to find something that avoids spin looping and does not change the API of Peer. Also, the WaitReady and Settle calls are redundant now, since we are actually waiting for NotifyJoin. But leaving them in does not hurt either.

Fixes #3287

SoloJacobs and others added 2 commits October 27, 2025 18:49
Two commits have already merged in order to address the flakiness of
this test.

However, I can still reproduce the issue using:
```sh
go test -failfast -run "TestClusterJoinAndReconnect/TestJoinLeave" -count 600 ./cluster
```
An easy way to increase the failure rate is to increase CPU load, e.g.,
```sh
yes > /dev/null &; yes > /dev/null &; yes > /dev/null &; yes > /dev/null &
```
On my machine the combination of these commands fails every time.

The underlying reason for the failure is that the test only waits for
`p2` to be ready, but this does not reflect whether `p` has updated its
memberlist. We can ensure that `p` has updated its memberlist by waiting
for `NotifyJoin` to be called. The test is now slightly slower, 0.8
seconds on my machine.

Fixes prometheus#3287

Signed-off-by: Solomon Jacobs <[email protected]>
@SoloJacobs
Copy link
Contributor Author

@gotjosh I think you can review this best.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Flaky Test: TestClusterJoinAndReconnect/TestJoinLeave flake

1 participant