Skip to content

[azcosmos] Control-plane retry: cross-region failover preempted by caller ctx cancellation #26649

@NaluTripician

Description

@NaluTripician

Summary

Same class of defect as Azure/azure-cosmos-dotnet-v3#5805 (fix in Azure/azure-cosmos-dotnet-v3#5806).

cosmos_client_retry_policy.go honors the caller's ctx.Err() and exits the retry loop before consulting the cross-region retry decision on control-plane reads. When an unhealthy preferred region triggers internal HTTP timeout escalation, a caller whose ctx deadline fires during escalation never gets the cross-region attempt that the SDK advertises.

The global endpoint manager policy (cosmos_global_endpoint_manager_policy.go:22-24) already uses context.WithoutCancel to shield its own internal retries from caller cancellation — the same pattern is missing from the client retry policy for control-plane reads.

Note: azcosmos has no collection / partition-key-range cache today, so the exact customer-visible failure mode of the .NET bug does not reproduce. This issue is filed to:

  1. Fix the existing gap on GetAccountProperties and region discovery, which are affected today.
  2. Prevent the defect class from silently recurring when / if a collection or partition cache is later added (at which point control-plane reads gate data-plane traffic, as in .NET / Python).

Files / lines

  • sdk/data/azcosmos/cosmos_client_retry_policy.go — client retry policy (~lines 24-79)
  • sdk/data/azcosmos/cosmos_global_endpoint_manager_policy.go:22-24 — reference for the context.WithoutCancel pattern already used locally.

Repro (sketch)

  1. Fresh cosmos.Client.
  2. Preferred regions [A, B]; A unreachable (blackhole / 503).
  3. Call client.AccountProperties(...) with a context bounded by context.WithTimeout(ctx, 36*time.Second).
  4. Observe: context deadline fires, no attempt against region B is made.

Expected behavior

When the caller's ctx is cancelled during an in-flight control-plane attempt, the retry policy is consulted; if it indicates a cross-region retry, one bounded attempt against the next region executes on a retry-scoped context derived with context.WithoutCancel + context.WithTimeout(grace).

Proposed fix direction

retryCtx, cancel := context.WithTimeout(context.WithoutCancel(ctx), graceWindow)
defer cancel()
// attempt against next region using retryCtx

Surface the original ctx.Err()-bearing error as the cause if the grace attempt also fails or expires. Grace window of 10s aligns with the .NET fix.

Cross-references

This defect class was identified in a cross-SDK investigation prompted by the .NET fix. Tracking issues:

/cc @NaluTripician

Metadata

Metadata

Assignees

Labels

ClientThis issue points to a problem in the data-plane of the library.CosmosService AttentionWorkflow: This issue is responsible by Azure service team.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions