Skip to content

HttpTimeoutPolicy: Fixes aggressive 500ms first-attempt timeout in HttpTimeoutPolicyControlPlaneRetriableHotPath#5816

Open
NaluTripician wants to merge 2 commits intomainfrom
users/ntripician/fix-issue-5642-hotpath-timeout
Open

HttpTimeoutPolicy: Fixes aggressive 500ms first-attempt timeout in HttpTimeoutPolicyControlPlaneRetriableHotPath#5816
NaluTripician wants to merge 2 commits intomainfrom
users/ntripician/fix-issue-5642-hotpath-timeout

Conversation

@NaluTripician
Copy link
Copy Markdown
Contributor

Summary

Raises the first-attempt timeout for HttpTimeoutPolicyControlPlaneRetriableHotPath from 500ms to 1s, aligning with the precedent set by HttpTimeoutPolicyForThinClient (#5496) and HttpTimeoutPolicyForPartitionFailover (#5484). This eliminates the spurious TaskCanceledException retries customers see on .NET 10 and in environments with moderate network latency.

Issue

Fixes #5642

Root Cause

HttpTimeoutPolicyControlPlaneRetriableHotPath.TimeoutsAndDelays used (0.5s, 5s, 65s) as its retry timeout sequence. The 500ms first-attempt budget is too tight for:

  1. .NET 10 HttpConnectionPool changes: connection establishment / TLS handshake / SslStream initialization frequently takes longer than 500ms on .NET 10 even when the network is healthy.
  2. Real-world latency: any deployment with moderate network latency (e.g., cross-region, traffic-shaped environments, or Azure App Services with cold connection pools) routinely exceeds 500ms on the very first attempt.

The result is a TaskCanceledException on attempt 1, an immediate (successful) retry on attempt 2, and noisy telemetry / wasted work even though nothing is actually failing. The reporter independently verified that raising the first-attempt timeout eliminates the issue.

The same class of bug was previously fixed in two sibling policies:

HttpTimeoutPolicyControlPlaneRetriableHotPath was missed by those PRs and still carries the original aggressive value.

Fix

Change the first-attempt timeout from 0.5s to 1s. Leave the 5s and 65s tail attempts unchanged so the overall retry budget for genuinely slow control-plane operations is preserved.

Alternatives Considered

  • (6s, 6s, 10s) matching HttpTimeoutPolicyForPartitionFailover — rejected because it shrinks the total retry budget from ~70s to ~22s. Hot-path control-plane calls (address resolution under high load) can legitimately need the 65s tail.
  • (6s, 6s, 65s) per the reporter''s suggestion — rejected as a larger behavior change than necessary. The SDK team''s most recent precedent ([Fundamentals] HttpTimeoutPolicy‎: Update the HttpTimeoutPolicyForThinClient to Use More Relaxed Timeouts #5496) settled on 1s for the first-attempt timeout in similar circumstances. Bumping to 1s removes the spurious cancellations while keeping the change surgical.
  • Make the value configurable — out of scope for a bug fix; would require API surface changes.

How the Fix Works

HttpTimeoutPolicyControlPlaneRetriableHotPath.Instance is consumed by GatewayAddressCache for control-plane address-resolution calls. On each attempt, CosmosHttpClient.SendHttpAsync reads the next (requestTimeout, delayForNextRequest) tuple from the policy and uses requestTimeout as the per-request CancellationTokenSource budget. Raising the first tuple''s requestTimeout from 500ms to 1s gives the very first request a more realistic budget; if it still times out, the existing retry path runs unchanged.

Testing

  • Existing tests pass: CosmosHttpClientCoreTests (17/17 passing locally, 42s)
  • Updated RetryTransientIssuesTestAsync so the mock delay for the first hot-path attempt (2s) still exceeds the new 1s timeout, preserving the original test intent (verifying that the cancellation token actually fires for each attempt).
  • Repro scenario from the issue (artificial 600ms latency producing spurious 500ms cancellations) is covered by the new first-attempt budget.

Risk Assessment

  • Breaking change: No
  • Performance impact: None in the steady state. Minor positive impact: fewer wasted HTTP attempts and retries when the network has moderate latency.
  • Behavior change for existing users: When the gateway is genuinely slow on the first attempt, callers will now wait up to 1s before the SDK retries, instead of 500ms. Total retry budget is unchanged (~70s).

Cross-SDK Impact

This is a .NET-only timeout policy class. No equivalent class with the same 500ms hard-coded value was found in the Java/Python/Go/JS/Rust Cosmos SDKs in prior triage of #5496 and #5484. No companion PRs needed.

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines:
Successfully started running 1 pipeline(s).

2 similar comments
@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines:
Successfully started running 1 pipeline(s).

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines:
Successfully started running 1 pipeline(s).

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines:
Successfully started running 1 pipeline(s).

…tpTimeoutPolicyControlPlaneRetriableHotPath

Raises the first-attempt timeout for HttpTimeoutPolicyControlPlaneRetriableHotPath
from 500ms to 1s, aligning with the precedent set by HttpTimeoutPolicyForThinClient
(#5496) and HttpTimeoutPolicyForPartitionFailover (#5484).

The original 500ms value was too aggressive for .NET 10's HttpConnectionPool
behavior and any environment with moderate network latency, producing spurious
TaskCanceledExceptions that the SDK then retried successfully but at the cost
of wasted work and noisy customer telemetry.

The 5s and 65s tail attempts are preserved to keep the existing retry budget
for genuinely slow control-plane operations.

Fixes #5642

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@NaluTripician NaluTripician force-pushed the users/ntripician/fix-issue-5642-hotpath-timeout branch from 1eab579 to 9af809d Compare April 30, 2026 19:27
@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines:
Successfully started running 1 pipeline(s).

Copy link
Copy Markdown
Member

@kundadebdatta kundadebdatta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines:
Successfully started running 1 pipeline(s).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

HttpTimeoutPolicyControlPlaneRetriableHotPath 500ms first-attempt timeout causes recurring failures on .NET 10

3 participants