Skip to content

Conversation

@hank95179
Copy link
Contributor

@hank95179 hank95179 commented Dec 12, 2025

Issue link

This Pull Request is linked to issue (URL): #4950

Description

This PR addresses the flakiness observed in test_tls_with_multiple_certificates_succeeds.
Similar to issue #4946, the TLS handshake process occasionally times out in high-load environments (such as CI/CD) due to resource contention, leading to ClosingError: ... timed out failures.

Solution

Implemented a retry mechanism for client creation in tests/async_tests/test_tls_certificates.py.

  • Implemented a retry mechanism for client creation in tests/async_tests/test_tls_certificates.py.
    The test now attempts to establish the connection up to 3 times with an exponential backoff (2^i seconds) before failing.
  • Applied this logic to both Cluster and Standalone modes to ensure stability.

Verification

Verified that the retry mechanism correctly handles transient connection failures, allowing the test to pass in environments with varying latency.

@hank95179 hank95179 requested a review from a team as a code owner December 12, 2025 08:12
@hank95179 hank95179 force-pushed the fix-issue-4950-tls-timeout branch from f09406e to 004a136 Compare December 12, 2025 08:36
@hank95179 hank95179 force-pushed the fix-issue-4950-tls-timeout branch from 004a136 to 7c8eb24 Compare December 13, 2025 10:29
@xShinnRyuu
Copy link
Collaborator

In your description under Solution you say the fix has a 1-second backoff. Your implementation uses time.sleep(2**i) within your retry loop, which is a duration equal to 2 raised to the power of i seconds.

I'm unsure if that is what was intended but an exponential backoff is usually the norm for retry. Ideally, its a not that high of an exponential, because if there were more retires it would take a very long time. But three retries is fine here with base 2 exponential.

Please update your description, or adjust your implementation. Otherwise, it generally looks good.

@hank95179
Copy link
Contributor Author

@xShinnRyuu Thank you! I've updated the description.

Copy link
Collaborator

@xShinnRyuu xShinnRyuu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Collaborator

@currantw currantw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved with one suggestion. Thanks for helping with this!

advanced_config=cluster_advanced_config,
)
client = await GlideClusterClient.create(cluster_config)
for i in range(3):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit. Might be worth extracting to a helper and adding a comment explaining why this retry mechanism is needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants