Skip to content

Conversation

@hank95179
Copy link
Contributor

@hank95179 hank95179 commented Dec 12, 2025

Issue link

This Pull Request is linked to issue (URL): #4946

Description

This PR addresses the flakiness observed in test_tls_with_self_signed_certificate_succeeds.
In high-load environments (such as CI/CD), the TLS handshake process occasionally times out due to resource contention, causing ClosingError: ... timed out failures.

Solution

Implemented a retry mechanism for client creation in tests/async_tests/test_tls_certificates.py.

  • Implemented a retry mechanism for client creation in tests/async_tests/test_tls_certificates.py.
    The test now attempts to establish the connection up to 3 times with an exponential backoff (2^i seconds) before failing.
  • Applied this logic to both Cluster and Standalone modes.

Verification

Verified locally by simulating connection latency. The retry logic successfully allows the test to pass even when initial connection attempts fail due to transient timeouts.

@hank95179 hank95179 requested a review from a team as a code owner December 12, 2025 08:09
@hank95179 hank95179 force-pushed the fix-issue-4946-tls-timeout branch from e575f32 to 8f22339 Compare December 12, 2025 08:39
@hank95179 hank95179 force-pushed the fix-issue-4946-tls-timeout branch from 8f22339 to 226131b Compare December 13, 2025 10:26
@xShinnRyuu
Copy link
Collaborator

xShinnRyuu commented Dec 17, 2025

In your description under Solution you say the fix has a 1-second backoff. Your implementation uses time.sleep(2**i) within your retry loop, which is a duration equal to 2 raised to the power of i seconds.

I'm unsure if that is what was intended but an exponential backoff is usually the norm for retry. Ideally, its a not that high of an exponential, because if there were more retires it would take a very long time. But three retries is fine here with base 2 exponential.

Please update your description, or adjust your implementation. Otherwise, it generally looks good.

@hank95179
Copy link
Contributor Author

@xShinnRyuu Thank you! I've fixed the description.

Copy link
Collaborator

@xShinnRyuu xShinnRyuu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Collaborator

@currantw currantw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved with one suggestion. Thanks for your help!

Comment on lines +84 to +91
for i in range(3):
try:
client = await GlideClusterClient.create(cluster_config)
break
except Exception:
if i == 2:
raise
time.sleep(2**i)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit. Similar to other pull request, I think it might make sense to extract to a helper and add a comment explaining why this is needed.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hank95179 , I agree with @currantw here, especially given this is happening across multiple tests. Would you be able to add the helpers in one of your PRs and use them in the other.

Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants