Summary
RequestInvokerHandler.SendAsync (in the FeedRangeEpk branch, around line 330 of Microsoft.Azure.Cosmos/src/Handler/RequestInvokerHandler.cs) calls overlappingRanges[0] without first checking that the list is non-empty. When TryGetOverlappingRangesAsync returns an empty (but non-null) list — which happens in the comparator-mismatch scenario tracked in #5897 / #5859 — the SDK throws:
System.ArgumentOutOfRangeException: Index was out of range… Actual value was 0.
at System.Collections.Generic.SortedList`2.GetValueAtIndex(Int32 index)
at Microsoft.Azure.Cosmos.Handlers.RequestInvokerHandler.SendAsync(...)
The exception bubbles up to the application as a fatal error, with no retry. Observed in MFS production telemetry: 4 occurrences in eastus2euap canary in May 2026 (full traces in #5859).
This issue tracks defense-in-depth hardening in v3 that is independent of (and complementary to) the principled comparator fix tracked in #5897 / #5899.
The defect
// Microsoft.Azure.Cosmos/src/Handler/RequestInvokerHandler.cs
// FeedRangeEpk branch of SendAsync, around line 330
IReadOnlyList<PartitionKeyRange> overlappingRanges = await routingMapProvider.TryGetOverlappingRangesAsync(
collectionFromCache.ResourceId,
feedRangeEpk.Range,
childTrace,
forceRefresh: false);
if (overlappingRanges == null) // ✅ null handled
{
CosmosException notFound = new CosmosException(
$"Stale cache for rid '{collectionFromCache.ResourceId}'",
statusCode: System.Net.HttpStatusCode.NotFound,
subStatusCode: default,
activityId: Guid.Empty.ToString(),
requestCharge: default);
return notFound.ToCosmosResponseMessage(request);
}
if (overlappingRanges.Count > 1) // ✅ split handled
{
…
return goneException.ToCosmosResponseMessage(request);
}
// overlappingRanges.Count == 1 // ❌ comment is wrong: this else fires for Count <= 1
else
{
Range<string> singleRange = overlappingRanges[0].ToRange(); // ❌ throws ArgumentOutOfRangeException if Count == 0
if ((singleRange.Min == feedRangeEpk.Range.Min) && (singleRange.Max == feedRangeEpk.Range.Max))
{
request.PartitionKeyRangeId = new Documents.PartitionKeyRangeIdentity(overlappingRanges[0].Id);
}
else
{
request.PartitionKeyRangeId = new Documents.PartitionKeyRangeIdentity(overlappingRanges[0].Id);
request.Headers.ReadFeedKeyType = RntbdConstants.RntdbReadFeedKeyType.EffectivePartitionKeyRange.ToString();
request.Headers.StartEpk = feedRangeEpk.Range.Min;
request.Headers.EndEpk = feedRangeEpk.Range.Max;
}
}
The SortedList.GetValueAtIndex(0) frame in the production stack appears because CollectionRoutingMap.GetOverlappingRanges returns new ReadOnlyCollection<>(sortedList.Values), and SortedList<TKey,TValue>.Values[int] is implemented internally via GetValueAtIndex(int). So overlappingRanges[0] on an empty result transitively throws there.
Proposed fix
Insert a Count == 0 branch that mirrors the null branch's semantics (return 404 NotFound), so the upstream NameCacheStaleRetryQueryPipelineStage invalidates the routing-map cache and retries the request — the same self-healing path that the null case already takes today.
if (overlappingRanges == null || overlappingRanges.Count == 0)
{
CosmosException notFound = new CosmosException(
$"Stale cache for rid '{collectionFromCache.ResourceId}' " +
$"- no overlapping ranges for '{feedRangeEpk.Range}'.",
statusCode: System.Net.HttpStatusCode.NotFound,
subStatusCode: default,
activityId: Guid.Empty.ToString(),
requestCharge: default);
return notFound.ToCosmosResponseMessage(request);
}
(Alternatively keep the two branches separate so the trace message is more diagnostically useful — e.g. "Stale cache" vs "Empty overlapping ranges" — author's call.)
Also fix the misleading // overlappingRanges.Count == 1 comment.
Why this matters even with #5897 / #5899 landed
| Scenario |
Without this guard |
With this guard |
| Comparator-asymmetry bug (#5897) regression on a new code path |
AOOR crash, no retry |
404 → forced cache refresh → retry → succeed |
Transient TryCombine race during multi-level partition split |
AOOR crash |
404 → forced cache refresh → retry → succeed |
| Stale cache held by long-running process after backend mitigation |
AOOR crash forever until process restart |
404 → forced cache refresh on first attempt → self-heal |
Genuine null result from TryGetOverlappingRangesAsync |
Already handled |
Already handled (unchanged) |
This is the smallest patch that converts an unrecoverable handler-side crash into the SDK's existing self-healing retry path.
Acceptance criteria
Risk / blast radius
- Behavioral change is limited to the previously-crashing path. Code paths that today return
Count >= 1 are completely unaffected.
- Returning
404 NotFound is the existing, well-tested response for "routing info missing for this request" — no new error code is introduced.
- No public API surface change.
- No Direct package version change required.
Implementation note
This is a small, self-contained change — recommend filing it as a standalone PR rather than bundling it with the broader comparator fix in #5897 (which is blocked on the Direct package overload in #5899). Shipping this guard independently provides immediate AOOR-crash mitigation while the principled fix follows its own release cadence.
References
Summary
RequestInvokerHandler.SendAsync(in theFeedRangeEpkbranch, around line 330 ofMicrosoft.Azure.Cosmos/src/Handler/RequestInvokerHandler.cs) callsoverlappingRanges[0]without first checking that the list is non-empty. WhenTryGetOverlappingRangesAsyncreturns an empty (but non-null) list — which happens in the comparator-mismatch scenario tracked in #5897 / #5859 — the SDK throws:The exception bubbles up to the application as a fatal error, with no retry. Observed in MFS production telemetry: 4 occurrences in
eastus2euapcanary in May 2026 (full traces in #5859).This issue tracks defense-in-depth hardening in v3 that is independent of (and complementary to) the principled comparator fix tracked in #5897 / #5899.
The defect
The
SortedList.GetValueAtIndex(0)frame in the production stack appears becauseCollectionRoutingMap.GetOverlappingRangesreturnsnew ReadOnlyCollection<>(sortedList.Values), andSortedList<TKey,TValue>.Values[int]is implemented internally viaGetValueAtIndex(int). SooverlappingRanges[0]on an empty result transitively throws there.Proposed fix
Insert a
Count == 0branch that mirrors thenullbranch's semantics (return404 NotFound), so the upstreamNameCacheStaleRetryQueryPipelineStageinvalidates the routing-map cache and retries the request — the same self-healing path that thenullcase already takes today.(Alternatively keep the two branches separate so the trace message is more diagnostically useful — e.g.
"Stale cache"vs"Empty overlapping ranges"— author's call.)Also fix the misleading
// overlappingRanges.Count == 1comment.Why this matters even with #5897 / #5899 landed
TryCombinerace during multi-level partition splitnullresult fromTryGetOverlappingRangesAsyncThis is the smallest patch that converts an unrecoverable handler-side crash into the SDK's existing self-healing retry path.
Acceptance criteria
RequestInvokerHandler.SendAsyncno longer throwsArgumentOutOfRangeExceptionwhenTryGetOverlappingRangesAsyncreturns an empty list.ResponseMessagewithStatusCode = NotFound(no sub-status) whenoverlappingRanges.Count == 0, matching the existing null-handling contract.// overlappingRanges.Count == 1comment removed or corrected.RequestInvokerHandlerTests.SendAsync_FeedRangeEpk_WithEmptyOverlappingRanges_ReturnsNotFound— mocksIRoutingMapProvider.TryGetOverlappingRangesAsyncto returnArray.Empty<PartitionKeyRange>(), invokesSendAsyncwith aFeedRangeEpk, assertsresponse.StatusCode == NotFoundand no exception is thrown.NameCacheStaleRetryQueryPipelineStageconsumes the 404 and triggers a forced refresh (can use existingHandlerTestspatterns).### UnreleasedinMicrosoft.Azure.Cosmos/changelog.md.Risk / blast radius
Count >= 1are completely unaffected.404 NotFoundis the existing, well-tested response for "routing info missing for this request" — no new error code is introduced.Implementation note
This is a small, self-contained change — recommend filing it as a standalone PR rather than bundling it with the broader comparator fix in #5897 (which is blocked on the Direct package overload in #5899). Shipping this guard independently provides immediate AOOR-crash mitigation while the principled fix follows its own release cadence.
References
fileShareManagementDataV2HPK container, May 2026)Microsoft.Azure.Cosmos/src/Handler/RequestInvokerHandler.cs— the call site to hardenMicrosoft.Azure.Cosmos/src/Query/Core/Pipeline/CrossPartition/NameCacheStaleRetryQueryPipelineStage.cs— the consumer of the returned 404 that triggers the self-healing forced refresh