Skip to content

Bug: RequestInvokerHandler.SendAsync throws ArgumentOutOfRangeException when TryGetOverlappingRangesAsync returns empty list (missing Count==0 guard) #5898

@ananth7592

Description

@ananth7592

Summary

RequestInvokerHandler.SendAsync (in the FeedRangeEpk branch, around line 330 of Microsoft.Azure.Cosmos/src/Handler/RequestInvokerHandler.cs) calls overlappingRanges[0] without first checking that the list is non-empty. When TryGetOverlappingRangesAsync returns an empty (but non-null) list — which happens in the comparator-mismatch scenario tracked in #5897 / #5859 — the SDK throws:

System.ArgumentOutOfRangeException: Index was out of range… Actual value was 0.
   at System.Collections.Generic.SortedList`2.GetValueAtIndex(Int32 index)
   at Microsoft.Azure.Cosmos.Handlers.RequestInvokerHandler.SendAsync(...)

The exception bubbles up to the application as a fatal error, with no retry. Observed in MFS production telemetry: 4 occurrences in eastus2euap canary in May 2026 (full traces in #5859).

This issue tracks defense-in-depth hardening in v3 that is independent of (and complementary to) the principled comparator fix tracked in #5897 / #5899.


The defect

// Microsoft.Azure.Cosmos/src/Handler/RequestInvokerHandler.cs
// FeedRangeEpk branch of SendAsync, around line 330

IReadOnlyList<PartitionKeyRange> overlappingRanges = await routingMapProvider.TryGetOverlappingRangesAsync(
    collectionFromCache.ResourceId,
    feedRangeEpk.Range,
    childTrace,
    forceRefresh: false);

if (overlappingRanges == null)                                        // ✅ null handled
{
    CosmosException notFound = new CosmosException(
        $"Stale cache for rid '{collectionFromCache.ResourceId}'",
        statusCode: System.Net.HttpStatusCode.NotFound,
        subStatusCode: default,
        activityId: Guid.Empty.ToString(),
        requestCharge: default);
    return notFound.ToCosmosResponseMessage(request);
}

if (overlappingRanges.Count > 1)                                      // ✅ split handled
{return goneException.ToCosmosResponseMessage(request);
}
// overlappingRanges.Count == 1                                       // ❌ comment is wrong: this else fires for Count <= 1
else
{
    Range<string> singleRange = overlappingRanges[0].ToRange();       // ❌ throws ArgumentOutOfRangeException if Count == 0
    if ((singleRange.Min == feedRangeEpk.Range.Min) && (singleRange.Max == feedRangeEpk.Range.Max))
    {
        request.PartitionKeyRangeId = new Documents.PartitionKeyRangeIdentity(overlappingRanges[0].Id);
    }
    else
    {
        request.PartitionKeyRangeId = new Documents.PartitionKeyRangeIdentity(overlappingRanges[0].Id);
        request.Headers.ReadFeedKeyType = RntbdConstants.RntdbReadFeedKeyType.EffectivePartitionKeyRange.ToString();
        request.Headers.StartEpk = feedRangeEpk.Range.Min;
        request.Headers.EndEpk = feedRangeEpk.Range.Max;
    }
}

The SortedList.GetValueAtIndex(0) frame in the production stack appears because CollectionRoutingMap.GetOverlappingRanges returns new ReadOnlyCollection<>(sortedList.Values), and SortedList<TKey,TValue>.Values[int] is implemented internally via GetValueAtIndex(int). So overlappingRanges[0] on an empty result transitively throws there.


Proposed fix

Insert a Count == 0 branch that mirrors the null branch's semantics (return 404 NotFound), so the upstream NameCacheStaleRetryQueryPipelineStage invalidates the routing-map cache and retries the request — the same self-healing path that the null case already takes today.

if (overlappingRanges == null || overlappingRanges.Count == 0)
{
    CosmosException notFound = new CosmosException(
        $"Stale cache for rid '{collectionFromCache.ResourceId}' " +
        $"- no overlapping ranges for '{feedRangeEpk.Range}'.",
        statusCode: System.Net.HttpStatusCode.NotFound,
        subStatusCode: default,
        activityId: Guid.Empty.ToString(),
        requestCharge: default);
    return notFound.ToCosmosResponseMessage(request);
}

(Alternatively keep the two branches separate so the trace message is more diagnostically useful — e.g. "Stale cache" vs "Empty overlapping ranges" — author's call.)

Also fix the misleading // overlappingRanges.Count == 1 comment.


Why this matters even with #5897 / #5899 landed

Scenario Without this guard With this guard
Comparator-asymmetry bug (#5897) regression on a new code path AOOR crash, no retry 404 → forced cache refresh → retry → succeed
Transient TryCombine race during multi-level partition split AOOR crash 404 → forced cache refresh → retry → succeed
Stale cache held by long-running process after backend mitigation AOOR crash forever until process restart 404 → forced cache refresh on first attempt → self-heal
Genuine null result from TryGetOverlappingRangesAsync Already handled Already handled (unchanged)

This is the smallest patch that converts an unrecoverable handler-side crash into the SDK's existing self-healing retry path.


Acceptance criteria

  • RequestInvokerHandler.SendAsync no longer throws ArgumentOutOfRangeException when TryGetOverlappingRangesAsync returns an empty list.
  • Returns a ResponseMessage with StatusCode = NotFound (no sub-status) when overlappingRanges.Count == 0, matching the existing null-handling contract.
  • Misleading // overlappingRanges.Count == 1 comment removed or corrected.
  • New unit test: RequestInvokerHandlerTests.SendAsync_FeedRangeEpk_WithEmptyOverlappingRanges_ReturnsNotFound — mocks IRoutingMapProvider.TryGetOverlappingRangesAsync to return Array.Empty<PartitionKeyRange>(), invokes SendAsync with a FeedRangeEpk, asserts response.StatusCode == NotFound and no exception is thrown.
  • Optional integration-style test verifying the upstream NameCacheStaleRetryQueryPipelineStage consumes the 404 and triggers a forced refresh (can use existing HandlerTests patterns).
  • All existing handler tests continue to pass.
  • Changelog entry under ### Unreleased in Microsoft.Azure.Cosmos/changelog.md.

Risk / blast radius

  • Behavioral change is limited to the previously-crashing path. Code paths that today return Count >= 1 are completely unaffected.
  • Returning 404 NotFound is the existing, well-tested response for "routing info missing for this request" — no new error code is introduced.
  • No public API surface change.
  • No Direct package version change required.

Implementation note

This is a small, self-contained change — recommend filing it as a standalone PR rather than bundling it with the broader comparator fix in #5897 (which is blocked on the Direct package overload in #5899). Shipping this guard independently provides immediate AOOR-crash mitigation while the principled fix follows its own release cadence.


References

Metadata

Metadata

Assignees

Labels

HierarchicalPartitioningTag to track issues related to Hierarchical Partitioned containersQUERYRoutingbugSomething isn't working

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions