Skip to content

Commit 0c12e8e

Browse files
QueryPlan : Fixes 410 Gone Exception on non-x64 platforms (#5257)
Windows 365 Defender team faced an issue with their HPK container set up. Problem was when they set the root level partition key in the QueryRequestOptions and run a query they would get the data back when running on windows but for other environments(like docker) they were getting an exception For Docker they were getting 410(Gone) exception Microsoft.Azure.Cosmos.CosmosException : Response status code does not indicate success: Gone (410); Substatus: 1002; ActivityId: 623305e2-9a69-4b65-aa9b-911ee8ac3533; Reason: (Epk Range: [0DCEB8CE51C6BFE84F4BD9409F69B9BB,0DCEB8CE51C6BFE84F4BD9409F69B9BBFF) is gone.); The root of the problem was that for non-window(x64) environment we were getting the query plan from gateway(unlike windows where we use the ServiceInterop dll) and after we got the plan we funnel it through RequestInvokerHandler pipeline. The pipeline has a check where it sees that if a feedRange is provided and if it returns more than 1 overlapping ranges it throws a gone exception but in this particular case this is a valid scenario(edge case). e.g datatable (MinEPK: string, MaxEPK: string, PKrangeId) [ "0D4DC2CD8F49C65A8E0C5306B61B4343","0DCEB8CE51C6BFE84F4BD9409F69B9BB2164DEBD78C50C850E0C1E3E3F0579ED",0 "0DCEB8CE51C6BFE84F4BD9409F69B9BB2164DEBD78C50C850E0C1E3E3F0579ED","1080F600C27CF98DC13F8639E94E7676" 1 ] If our provided partition key results in a hash e.g 0DCEB8CE51C6BFE84F4BD9409F69B9BB that lies between the above 2 records, we will get overlapping ranges back. Fix: Fix it to explicitly make a check and see if its a Query Plan , if it is then allow it to progress. Testing: I was able to get access to the Defender Team DB where they were facing this issue. I manually tested the following E2E use cases and got the desired results. On Windows -> with PartitionKey specified Diagnostics is : {"Summary":{"DirectCalls":{"(449, 5350)":6,"(200, 0)":2},"GatewayCalls":{"(200, 0)":4,"(304, 0)":1}} Count: 24025; Severity: Medium; Count: 24114; Severity: Informational; Count: 23; Severity: High; On Non_windows - with PartitionKey specified(in the QueryRequestOptions) Diagnostics is : {"Summary":{"DirectCalls":{"(449, 5350)":2,"(200, 0)":2},"GatewayCalls":{"(200, 0)":5,"(304, 0)":1,"(0, 0)":1}} Count: 24025; Severity: Medium; Count: 24114; Severity: Informational; Count: 23; Severity: High; On Windows - no QueryRequestOption specified Diagnostics is : {"Summary":{"DirectCalls":{"(200, 0)":2},"GatewayCalls":{"(200, 0)":4,"(304, 0)":1}} Count: 24025; Severity: Medium; Count: 24114; Severity: Informational; Count: 23; Severity: High; On Non_windows - no QueryRequestOption specified Diagnostics is : {"Summary":{"DirectCalls":{"(200, 0)":2},"GatewayCalls":{"(200, 0)":5,"(304, 0)":1}} Count: 24025; Severity: Medium; Count: 24114; Severity: Informational; Count: 23; Severity: High; Problem writing a new reliable E2E test in the pipeline is the set up needed to return overlapping ranges and we don't have control over when partition split happens. Talked to Ananth on how he tested it and he worked with Elasticity team to create the database in the desired state, we can do the same in the pipeline but need to check if our database accounts/containers get cleaned up by some background job in the future(will check with Nalu on this one). I also explored the PartitionKeyHashRangeSplitterAndMerger class but that works only with in-memory container and doesn't follow the RequestInvokerHandler pipeline flow. for the time being I added a unit test which will avoid any regression in the future. closes #5220 #5220 --------- Co-authored-by: Kiran Kumar Kolli <kirankk@microsoft.com>
1 parent ea9a8a8 commit 0c12e8e

2 files changed

Lines changed: 154 additions & 11 deletions

File tree

Microsoft.Azure.Cosmos/src/Handler/RequestInvokerHandler.cs

Lines changed: 16 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -305,17 +305,22 @@ public virtual async Task<ResponseMessage> SendAsync(
305305
// For epk range filtering we can end up in one of 3 cases:
306306
if (overlappingRanges.Count > 1)
307307
{
308-
// 1) The EpkRange spans more than one physical partition
309-
// In this case it means we have encountered a split and
310-
// we need to bubble that up to the higher layers to update their datastructures
311-
CosmosException goneException = new CosmosException(
312-
message: $"Epk Range: {feedRangeEpk.Range} is gone.",
313-
statusCode: System.Net.HttpStatusCode.Gone,
314-
subStatusCode: (int)SubStatusCodes.PartitionKeyRangeGone,
315-
activityId: Guid.NewGuid().ToString(),
316-
requestCharge: default);
317-
318-
return goneException.ToCosmosResponseMessage(request);
308+
//If we are running a query plan and our provided partition key results in a hash that resolves to more than one EPKRanges then its a valid use case
309+
bool isQueryPlanOperation = request.ResourceType == ResourceType.Document && request.OperationType == OperationType.QueryPlan;
310+
if (!isQueryPlanOperation)
311+
{
312+
// 1) The EpkRange spans more than one physical partition
313+
// In this case it means we have encountered a split and
314+
// we need to bubble that up to the higher layers to update their datastructures
315+
CosmosException goneException = new CosmosException(
316+
message: $"Epk Range: {feedRangeEpk.Range} is gone.",
317+
statusCode: System.Net.HttpStatusCode.Gone,
318+
subStatusCode: (int)SubStatusCodes.PartitionKeyRangeGone,
319+
activityId: Guid.NewGuid().ToString(),
320+
requestCharge: default);
321+
322+
return goneException.ToCosmosResponseMessage(request);
323+
}
319324
}
320325
// overlappingRanges.Count == 1
321326
else

Microsoft.Azure.Cosmos/tests/Microsoft.Azure.Cosmos.Tests/RetryHandlerTests.cs

Lines changed: 138 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,11 +5,16 @@
55
namespace Microsoft.Azure.Cosmos.Tests
66
{
77
using System;
8+
using System.Collections.Generic;
9+
using System.IO;
810
using System.Net;
911
using System.Net.Http;
1012
using System.Threading;
1113
using System.Threading.Tasks;
14+
using global::Azure;
1215
using Microsoft.Azure.Cosmos.Handlers;
16+
using Microsoft.Azure.Cosmos.Routing;
17+
using Microsoft.Azure.Cosmos.Tracing;
1318
using Microsoft.Azure.Documents;
1419
using Microsoft.VisualStudio.TestTools.UnitTesting;
1520
using Moq;
@@ -18,6 +23,139 @@ namespace Microsoft.Azure.Cosmos.Tests
1823
public class RetryHandlerTests
1924
{
2025
private static readonly Uri TestUri = new Uri("https://dummy.documents.azure.com:443/dbs");
26+
[TestMethod]
27+
public async Task ValidateQueryPlanDoesNotThrowExceptionForOverlappingRanges()
28+
{
29+
await this.ValidateOverlappingRangesBehaviorAsync(
30+
operationType: OperationType.QueryPlan,
31+
shouldThrowGoneException: false);
32+
}
33+
34+
[TestMethod]
35+
public async Task ValidateQueryThrowsGoneExceptionForOverlappingRanges()
36+
{
37+
await this.ValidateOverlappingRangesBehaviorAsync(
38+
operationType: OperationType.Query,
39+
shouldThrowGoneException: true);
40+
}
41+
42+
private async Task ValidateOverlappingRangesBehaviorAsync(
43+
OperationType operationType,
44+
bool shouldThrowGoneException)
45+
{
46+
// Create overlapping ranges for the test
47+
List<PartitionKeyRange> overlappingRanges = new List<PartitionKeyRange>
48+
{
49+
new PartitionKeyRange { Id = "0", MinInclusive = "0D4DC2CD8F49C65A8E0C5306B61B4343", MaxExclusive = "0DCEB8CE51C6BFE84F4BD9409F69B9BB2164DEBD78C50C850E0C1E3E3F0579ED" },
50+
new PartitionKeyRange { Id = "1", MinInclusive = "0DCEB8CE51C6BFE84F4BD9409F69B9BB2164DEBD78C50C850E0C1E3E3F0579ED", MaxExclusive = "1080F600C27CF98DC13F8639E94E7676" }
51+
};
52+
53+
// Create a custom document client with our TestPartitionKeyRangeCache
54+
var testPartitionKeyRangeCache = new TestPartitionKeyRangeCache(overlappingRanges);
55+
var customDocClient = new CustomMockDocumentClient(testPartitionKeyRangeCache);
56+
57+
// Create CosmosClient with our custom document client
58+
using CosmosClient client = new CosmosClient(
59+
"https://localhost:8081",
60+
MockCosmosUtil.RandomInvalidCorrectlyFormatedAuthKey,
61+
new CosmosClientOptions(),
62+
customDocClient);
63+
64+
// Create mock container
65+
Mock<ContainerInternal> containerMock = MockCosmosUtil.CreateMockContainer("testDb", "testColl");
66+
67+
// Setup container properties
68+
ContainerProperties containerProps = new ContainerProperties("testColl", "/pk");
69+
var resourceIdProperty = typeof(ContainerProperties).GetProperty(
70+
"ResourceId",
71+
System.Reflection.BindingFlags.NonPublic | System.Reflection.BindingFlags.Instance);
72+
resourceIdProperty.SetValue(containerProps, "testCollRid");
73+
74+
// Set up additional mocks as needed
75+
containerMock.Setup(c => c.GetCachedContainerPropertiesAsync(
76+
It.IsAny<bool>(), It.IsAny<ITrace>(), It.IsAny<CancellationToken>()))
77+
.ReturnsAsync(containerProps);
78+
79+
Mock<Cosmos.Database> databaseMock = new Mock<Cosmos.Database>();
80+
databaseMock.Setup(d => d.Id).Returns("testDb");
81+
containerMock.Setup(c => c.Database).Returns(databaseMock.Object);
82+
83+
// FeedRangeEpk for the test - use a range that overlaps both partition key ranges
84+
FeedRangeEpk feedRange = new FeedRangeEpk(new Documents.Routing.Range<string>(
85+
"0DCEB8CE51C6BFE84F4BD9409F69B9BB",
86+
"0DCEB8CE51C6BFE84F4BD9409F69B9BBFF",
87+
true, false));
88+
89+
RequestInvokerHandler invoker = new RequestInvokerHandler(client, null, null, null)
90+
{
91+
InnerHandler = new TestHandler((request, token) => TestHandler.ReturnSuccess())
92+
};
93+
94+
// Act
95+
ResponseMessage response = await invoker.SendAsync(
96+
"dbs/testDb/colls/testColl",
97+
ResourceType.Document,
98+
operationType,
99+
null,
100+
containerMock.Object,
101+
feedRange,
102+
null,
103+
null,
104+
NoOpTrace.Singleton,
105+
CancellationToken.None);
106+
107+
// Assert
108+
Assert.IsNotNull(response, "Response should not be null.");
109+
110+
if (shouldThrowGoneException)
111+
{
112+
Assert.IsFalse(response.IsSuccessStatusCode, "Expected a failure status code for Query operation.");
113+
Assert.AreEqual(HttpStatusCode.Gone, response.StatusCode, "Expected a 410 Gone status code.");
114+
Assert.AreEqual((int)SubStatusCodes.PartitionKeyRangeGone, (int)response.Headers.SubStatusCode, "Expected PartitionKeyRangeGone sub-status code.");
115+
}
116+
else
117+
{
118+
Assert.IsTrue(response.IsSuccessStatusCode, $"Expected a successful status code, but got {response.StatusCode}.");
119+
}
120+
}
121+
122+
// Custom MockDocumentClient that allows injecting our TestPartitionKeyRangeCache
123+
private class CustomMockDocumentClient : MockDocumentClient
124+
{
125+
private readonly TestPartitionKeyRangeCache testPartitionKeyRangeCache;
126+
127+
public CustomMockDocumentClient(TestPartitionKeyRangeCache testPartitionKeyRangeCache)
128+
: base(new ConnectionPolicy())
129+
{
130+
this.testPartitionKeyRangeCache = testPartitionKeyRangeCache;
131+
}
132+
133+
internal override Task<PartitionKeyRangeCache> GetPartitionKeyRangeCacheAsync(ITrace trace)
134+
{
135+
return Task.FromResult<PartitionKeyRangeCache>(this.testPartitionKeyRangeCache);
136+
}
137+
}
138+
139+
private class TestPartitionKeyRangeCache : PartitionKeyRangeCache
140+
{
141+
private readonly IReadOnlyList<PartitionKeyRange> overlappingRanges;
142+
143+
public TestPartitionKeyRangeCache(IReadOnlyList<PartitionKeyRange> overlappingRanges)
144+
: base(null, null, null, null) // Pass nulls or mocks as needed for base constructor
145+
{
146+
this.overlappingRanges = overlappingRanges;
147+
}
148+
149+
public override Task<IReadOnlyList<PartitionKeyRange>> TryGetOverlappingRangesAsync(
150+
string collectionRid,
151+
Documents.Routing.Range<string> range,
152+
ITrace trace,
153+
bool forceRefresh)
154+
{
155+
return Task.FromResult(this.overlappingRanges);
156+
}
157+
}
158+
21159

22160
[TestMethod]
23161
public async Task RetryHandlerDoesNotRetryOnSuccess()

0 commit comments

Comments
 (0)