Skip to content

Reads routed via defaultEndpoint do not failover after write region switch when ExcludeRegions filters all preferred regions #5821

@ananth7592

Description

@ananth7592

Summary

When ApplicationPreferredRegions == ExcludeRegions, the SDK falls back to the region-agnostic defaultEndpoint for reads. After a write region switch, writes failover correctly (they use AvailableWriteLocations[0] directly), but reads remain pinned to the old region's address resolution because the GlobalAddressResolver's cached EndpointCache for the default endpoint has a stale location string frozen at client init time.

Root Cause

There are two layers to this bug:

Layer 1: LocationCache.GetApplicableEndpoints uses static defaultEndpoint as read fallback

LocationCache.cs#L372-L388

public ReadOnlyCollection<Uri> GetApplicableEndpoints(DocumentServiceRequest request, bool isReadRequest)
{
    // ...
    return GetApplicableEndpoints(
        isReadRequest ? this.locationInfo.AvailableReadEndpointByLocation : this.locationInfo.AvailableWriteEndpointByLocation,
        effectivePreferredLocations,
        this.defaultEndpoint,    // ← BUG: Static, region-agnostic, never updated
        request.RequestContext.ExcludeRegions);
}

The private static helper at L416-L449 uses this fallback when no endpoints survive filtering:

if (applicableEndpoints.Count == 0)
{
    applicableEndpoints.Add(fallbackEndpoint); // ← "myaccount.documents.azure.com"
}

Key insight: The fix already exists in UpdateLocationCache! At L756-L760, ReadEndpoints correctly uses WriteEndpoints[0] as fallback:

nextLocationInfo.ReadEndpoints = this.GetPreferredAvailableEndpoints(
    endpointsByLocation: nextLocationInfo.AvailableReadEndpointByLocation,
    orderedLocations: nextLocationInfo.AvailableReadLocations,
    expectedAvailableOperation: OperationType.Read,
    fallbackEndpoint: nextLocationInfo.WriteEndpoints[0]);  // ← Dynamic, correct!

But GetApplicableEndpoints (the ExcludeRegions path) bypasses this and uses this.defaultEndpoint instead.

Layer 2: GlobalAddressResolver.GetOrAddEndpoint caches stale AddressResolver.location

GlobalAddressResolver.cs#L327-L336 — Once an endpoint is cached at init, TryGetValue returns immediately without validating location:

if (this.addressCacheByEndpoint.TryGetValue(endpoint, out EndpointCache existingCache))
{
    return existingCache; // ← Never checks if location drifted
}

AddressResolver.cs#L34location is private readonly string, frozen at creation.

AddressResolver.cs#L72 — Propagates stale value: request.RequestContext.RegionName = this.location

Proposed Fix

Change GetApplicableEndpoints to use WriteEndpoints[0] as the read fallback instead of this.defaultEndpoint.

This aligns with:

  • The existing pattern in UpdateLocationCache (L760) which already uses WriteEndpoints[0] for ReadEndpoints
  • Java SDK: LocationCache.java#L266writeRegionalRoutingContexts.get(0)
  • Python SDK: _location_cache.py#L241get_write_regional_routing_contexts()[0]
public ReadOnlyCollection<Uri> GetApplicableEndpoints(DocumentServiceRequest request, bool isReadRequest)
{
    if (request.RequestContext.ExcludeRegions == null || request.RequestContext.ExcludeRegions.Count == 0)
    {
        return isReadRequest ? this.ReadEndpoints : this.WriteEndpoints;
    }

    DatabaseAccountLocationsInfo databaseAccountLocationsInfoSnapshot = this.locationInfo;
    ReadOnlyCollection<string> effectivePreferredLocations = databaseAccountLocationsInfoSnapshot.EffectivePreferredLocations;

    Uri fallbackEndpoint = isReadRequest
        ? databaseAccountLocationsInfoSnapshot.WriteEndpoints[0]  // Dynamic: tracks current write region
        : this.defaultEndpoint;

    return GetApplicableEndpoints(
        isReadRequest ? this.locationInfo.AvailableReadEndpointByLocation : this.locationInfo.AvailableWriteEndpointByLocation,
        effectivePreferredLocations,
        fallbackEndpoint,
        request.RequestContext.ExcludeRegions);
}

Why Read-Path Only

  • Writes in single-master bypass ExcludeRegions entirely at L347-348 — they use AvailableWriteLocations directly
  • Reads hit GetApplicableEndpoints → all filtered → fallback to defaultEndpoint → stale AddressResolver.location

Impact

What Before Fix After Fix
Read fallback endpoint defaultEndpoint (region-agnostic, static) WriteEndpoints[0] (region-specific, dynamic)
AddressResolver.location Stale after hub switch Correct (cache keyed by regional URI, recreated)
request.RequestContext.RegionName Reports wrong region Reports correct region
Diagnostics Wrong region in traces Correct region
Per-partition routing May make wrong decisions Correct decisions

Cross-SDK Comparison

SDK Read Fallback When All Excluded Correct?
.NET (current) this.defaultEndpoint ❌ Static, stale
.NET (proposed) WriteEndpoints[0] ✅ Dynamic
Java writeRegionalRoutingContexts.get(0) ✅ Dynamic
Python get_write_regional_routing_contexts()[0] ✅ Dynamic
Rust self.default_endpoint (gateway-only, no AddressResolver) ⚠️ azure-sdk-for-rust#4322

Reproduction Scenario

  1. Configure client with ApplicationPreferredRegions = ["East US"]
  2. Send reads with ExcludeRegions = ["East US"]
  3. Reads fall back to defaultEndpointAddressResolver.location = "East US" (init-time write region)
  4. Trigger write region failover: East US → West US
  5. WriteEndpoints[0] now points to West US regional endpoint (correctly updated)
  6. But GetApplicableEndpoints still returns defaultEndpoint → stale cache hit in GlobalAddressResolver
  7. request.RequestContext.RegionName incorrectly reports "East US"

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions