HttpTimeoutPolicy: Fixes aggressive 500ms first-attempt timeout in HttpTimeoutPolicyControlPlaneRetriableHotPath#5816
Open
NaluTripician wants to merge 2 commits intomainfrom
Open
Conversation
|
Azure Pipelines: Successfully started running 1 pipeline(s). |
2 similar comments
|
Azure Pipelines: Successfully started running 1 pipeline(s). |
|
Azure Pipelines: Successfully started running 1 pipeline(s). |
|
Azure Pipelines: Successfully started running 1 pipeline(s). |
…tpTimeoutPolicyControlPlaneRetriableHotPath Raises the first-attempt timeout for HttpTimeoutPolicyControlPlaneRetriableHotPath from 500ms to 1s, aligning with the precedent set by HttpTimeoutPolicyForThinClient (#5496) and HttpTimeoutPolicyForPartitionFailover (#5484). The original 500ms value was too aggressive for .NET 10's HttpConnectionPool behavior and any environment with moderate network latency, producing spurious TaskCanceledExceptions that the SDK then retried successfully but at the cost of wasted work and noisy customer telemetry. The 5s and 65s tail attempts are preserved to keep the existing retry budget for genuinely slow control-plane operations. Fixes #5642 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
1eab579 to
9af809d
Compare
|
Azure Pipelines: Successfully started running 1 pipeline(s). |
|
Azure Pipelines: Successfully started running 1 pipeline(s). |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Raises the first-attempt timeout for
HttpTimeoutPolicyControlPlaneRetriableHotPathfrom 500ms to 1s, aligning with the precedent set byHttpTimeoutPolicyForThinClient(#5496) andHttpTimeoutPolicyForPartitionFailover(#5484). This eliminates the spuriousTaskCanceledExceptionretries customers see on .NET 10 and in environments with moderate network latency.Issue
Fixes #5642
Root Cause
HttpTimeoutPolicyControlPlaneRetriableHotPath.TimeoutsAndDelaysused(0.5s, 5s, 65s)as its retry timeout sequence. The 500ms first-attempt budget is too tight for:The result is a
TaskCanceledExceptionon attempt 1, an immediate (successful) retry on attempt 2, and noisy telemetry / wasted work even though nothing is actually failing. The reporter independently verified that raising the first-attempt timeout eliminates the issue.The same class of bug was previously fixed in two sibling policies:
HttpTimeoutPolicyForThinClient:(0.5s, 1s, 5s)→(1s, 6s, 6s)HttpTimeoutPolicyForPartitionFailoverto Use More Relaxed Timeouts for Query Workloads #5484 —HttpTimeoutPolicyForPartitionFailover: relaxed similarlyHttpTimeoutPolicyControlPlaneRetriableHotPathwas missed by those PRs and still carries the original aggressive value.Fix
Change the first-attempt timeout from
0.5sto1s. Leave the 5s and 65s tail attempts unchanged so the overall retry budget for genuinely slow control-plane operations is preserved.Alternatives Considered
(6s, 6s, 10s)matchingHttpTimeoutPolicyForPartitionFailover— rejected because it shrinks the total retry budget from ~70s to ~22s. Hot-path control-plane calls (address resolution under high load) can legitimately need the 65s tail.(6s, 6s, 65s)per the reporter''s suggestion — rejected as a larger behavior change than necessary. The SDK team''s most recent precedent ([Fundamentals] HttpTimeoutPolicy: Update the HttpTimeoutPolicyForThinClient to Use More Relaxed Timeouts #5496) settled on1sfor the first-attempt timeout in similar circumstances. Bumping to 1s removes the spurious cancellations while keeping the change surgical.How the Fix Works
HttpTimeoutPolicyControlPlaneRetriableHotPath.Instanceis consumed byGatewayAddressCachefor control-plane address-resolution calls. On each attempt,CosmosHttpClient.SendHttpAsyncreads the next(requestTimeout, delayForNextRequest)tuple from the policy and usesrequestTimeoutas the per-requestCancellationTokenSourcebudget. Raising the first tuple''srequestTimeoutfrom 500ms to 1s gives the very first request a more realistic budget; if it still times out, the existing retry path runs unchanged.Testing
CosmosHttpClientCoreTests(17/17 passing locally, 42s)RetryTransientIssuesTestAsyncso the mock delay for the first hot-path attempt (2s) still exceeds the new 1s timeout, preserving the original test intent (verifying that the cancellation token actually fires for each attempt).Risk Assessment
Cross-SDK Impact
This is a .NET-only timeout policy class. No equivalent class with the same 500ms hard-coded value was found in the Java/Python/Go/JS/Rust Cosmos SDKs in prior triage of #5496 and #5484. No companion PRs needed.