Summary
Same class of defect as Azure/azure-cosmos-dotnet-v3#5805 (fix in Azure/azure-cosmos-dotnet-v3#5806).
cosmos_client_retry_policy.go honors the caller's ctx.Err() and exits the retry loop before consulting the cross-region retry decision on control-plane reads. When an unhealthy preferred region triggers internal HTTP timeout escalation, a caller whose ctx deadline fires during escalation never gets the cross-region attempt that the SDK advertises.
The global endpoint manager policy (cosmos_global_endpoint_manager_policy.go:22-24) already uses context.WithoutCancel to shield its own internal retries from caller cancellation — the same pattern is missing from the client retry policy for control-plane reads.
Note: azcosmos has no collection / partition-key-range cache today, so the exact customer-visible failure mode of the .NET bug does not reproduce. This issue is filed to:
- Fix the existing gap on
GetAccountProperties and region discovery, which are affected today.
- Prevent the defect class from silently recurring when / if a collection or partition cache is later added (at which point control-plane reads gate data-plane traffic, as in .NET / Python).
Files / lines
sdk/data/azcosmos/cosmos_client_retry_policy.go — client retry policy (~lines 24-79)
sdk/data/azcosmos/cosmos_global_endpoint_manager_policy.go:22-24 — reference for the context.WithoutCancel pattern already used locally.
Repro (sketch)
- Fresh
cosmos.Client.
- Preferred regions
[A, B]; A unreachable (blackhole / 503).
- Call
client.AccountProperties(...) with a context bounded by context.WithTimeout(ctx, 36*time.Second).
- Observe: context deadline fires, no attempt against region
B is made.
Expected behavior
When the caller's ctx is cancelled during an in-flight control-plane attempt, the retry policy is consulted; if it indicates a cross-region retry, one bounded attempt against the next region executes on a retry-scoped context derived with context.WithoutCancel + context.WithTimeout(grace).
Proposed fix direction
retryCtx, cancel := context.WithTimeout(context.WithoutCancel(ctx), graceWindow)
defer cancel()
// attempt against next region using retryCtx
Surface the original ctx.Err()-bearing error as the cause if the grace attempt also fails or expires. Grace window of 10s aligns with the .NET fix.
Cross-references
This defect class was identified in a cross-SDK investigation prompted by the .NET fix. Tracking issues:
/cc @NaluTripician
Summary
Same class of defect as Azure/azure-cosmos-dotnet-v3#5805 (fix in Azure/azure-cosmos-dotnet-v3#5806).
cosmos_client_retry_policy.gohonors the caller'sctx.Err()and exits the retry loop before consulting the cross-region retry decision on control-plane reads. When an unhealthy preferred region triggers internal HTTP timeout escalation, a caller whosectxdeadline fires during escalation never gets the cross-region attempt that the SDK advertises.The global endpoint manager policy (
cosmos_global_endpoint_manager_policy.go:22-24) already usescontext.WithoutCancelto shield its own internal retries from caller cancellation — the same pattern is missing from the client retry policy for control-plane reads.Note: azcosmos has no collection / partition-key-range cache today, so the exact customer-visible failure mode of the .NET bug does not reproduce. This issue is filed to:
GetAccountPropertiesand region discovery, which are affected today.Files / lines
sdk/data/azcosmos/cosmos_client_retry_policy.go— client retry policy (~lines 24-79)sdk/data/azcosmos/cosmos_global_endpoint_manager_policy.go:22-24— reference for thecontext.WithoutCancelpattern already used locally.Repro (sketch)
cosmos.Client.[A, B];Aunreachable (blackhole / 503).client.AccountProperties(...)with a context bounded bycontext.WithTimeout(ctx, 36*time.Second).Bis made.Expected behavior
When the caller's
ctxis cancelled during an in-flight control-plane attempt, the retry policy is consulted; if it indicates a cross-region retry, one bounded attempt against the next region executes on a retry-scoped context derived withcontext.WithoutCancel+context.WithTimeout(grace).Proposed fix direction
Surface the original
ctx.Err()-bearing error as the cause if the grace attempt also fails or expires. Grace window of 10s aligns with the .NET fix.Cross-references
This defect class was identified in a cross-SDK investigation prompted by the .NET fix. Tracking issues:
retryWhenoperator structurally isolates subscription cancellation from the retry decision, so the defect does not reproduce./cc @NaluTripician