Skip to content

Fix Flaky Tests#28

Merged
ikolomi merged 9 commits intovalkey-io:mainfrom
jeremyprime:flaky-tests
Nov 28, 2025
Merged

Fix Flaky Tests#28
ikolomi merged 9 commits intovalkey-io:mainfrom
jeremyprime:flaky-tests

Conversation

@jeremyprime
Copy link
Copy Markdown
Collaborator

@jeremyprime jeremyprime commented Nov 25, 2025

Summary

Updates the following flaky tests that sometimes fail:

  • ValkeyGlideClusterConnectionCommandsIntegrationTests.testClusterGetClusterInfo:278 (expects 4, but gets 5)
  • ReactiveValkeyMessageListenerContainerIntegrationTests.multipleListenShouldTrackSubscriptions (expects null, but gets message)
  • DefaultHyperLogLogOperationsIntegrationTests.sizeShouldCountValuesCorrectly:96 (expects 3, but gets 2)
  • ClusterSlotHashUtilsTests.localCalculationShouldMatchServers:60 » JedisConnection Unexpected end of stream. (connection is stale/already closed)

Note there is another change in #23 to fix the Maven cache, This may help with network-related flaky tests:

  • Could not transfer artifact org.jetbrains.kotlin:kotlin-compiler:jar:1.9.25 from/to central
  • gzip: stdin: not in gzip format when getting engine binary

Closes #26.

Testing

Tests pass locally and in CI a couple times (although these tests failed intermittently, so will need to observe over time).

Signed-off-by: Jeremy Parr-Pearson <jeremy.parr-pearson@improving.com>
Signed-off-by: Jeremy Parr-Pearson <jeremy.parr-pearson@improving.com>
Signed-off-by: Jeremy Parr-Pearson <jeremy.parr-pearson@improving.com>
Signed-off-by: Jeremy Parr-Pearson <jeremy.parr-pearson@improving.com>
@jeremyprime jeremyprime requested a review from ikolomi November 25, 2025 22:19

// Verify node count
assertThat(clusterInfo.getKnownNodes()).isEqualTo(EXPECTED_TOTAL_NODES);
// Wait for all cluster nodes to be available
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i dont like retrying in the tests - we need to understand why does it happen...

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure of the underlying reason, but when checking getKnownNodes it reports a higher count than it should sometimes (5 instead of 4). It seems that the node count is temporarily inconsistent while the cluster is initializing. A small sleep or retry eventually settles on the expected number of nodes (4).

assertThat(c1Collector.poll(5, TimeUnit.SECONDS)).isNotNull();
assertThat(c2Collector.poll(100, TimeUnit.MILLISECONDS)).isNull();
// Wait for active subscription to receive message
await().atMost(Duration.ofSeconds(5))
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets not use .untilAsserted in the test unless we understand the root cause

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It takes time for the message to be propagated and for the disposed subscriber to stop listening, which is why they originally had a 0.5s sleep. But that is not enough sometimes and the disposed subscriber still gets the message when it should not. A larger sleep or retry allows the subscription cleanup to complete before the message is sent.

}
}

jedis.close();
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this code problematic - should not the jedis object be explicitly closed?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This issue does not happen as often, but seems to be caused by the connection pool closing the connection prematurely sometimes so it fails when jedis.close() is explicitly called on an already closed connection. The try-with-resources closes the connection without throwing an error in case it was already closed.

@ikolomi
Copy link
Copy Markdown
Collaborator

ikolomi commented Nov 26, 2025

  1. I would not touch the existing bugs in SpringDataRedis until we stabilize or at least after alpha
  2. .untillAsserted() should be used as a last option - why do we even need to retry. Lets try to fix the root cause

Signed-off-by: Jeremy Parr-Pearson <jeremy.parr-pearson@improving.com>
@jeremyprime
Copy link
Copy Markdown
Collaborator Author

  1. I would not touch the existing bugs in SpringDataRedis until we stabilize or at least after alpha
  2. .untillAsserted() should be used as a last option - why do we even need to retry. Lets try to fix the root cause

Most failing tests seem to be timing issues that only occur in our CI. ValkeyGlideClusterConnectionCommandsIntegrationTests.testClusterGetClusterInfo fails most often, but some of the existing spring-data-redis tests fail frequently as well, especially ReactiveValkeyMessageListenerContainerIntegrationTests.multipleListenShouldTrackSubscriptions.

I replaced untilAsserted with a sleep (or just increased the existing sleep) to simplify things.

Signed-off-by: Jeremy Parr-Pearson <jeremy.parr-pearson@improving.com>
Signed-off-by: Jeremy Parr-Pearson <jeremy.parr-pearson@improving.com>
@ikolomi ikolomi merged commit 36e7276 into valkey-io:main Nov 28, 2025
106 checks passed
@jeremyprime jeremyprime deleted the flaky-tests branch December 15, 2025 19:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ValkeyGlideClusterConnectionCommandsIntegrationTests.testClusterGetClusterInfo is flaky with github runner

2 participants