Description
Background:
During one of the backend drills, it was identified that when the following quorum loss condition is met, and the user provides a cancellation token, SDK honors the token, however doesn't apply the partition level fail over for the offending partition:
- Quorum loss injected with the quorum replicas (3 out of 4 replicas are down).
- The primary replica is specifically down.
- A cancellation token with 5 seconds of timeout value is provided.
Observation:
- SDK doesn't apply the partition level override and the subsequent write requests fails on the current faulty region/ partition.
Sample Diagnostics:
Diagnostics-1
{
"Summary": {
"GatewayCalls": {
"(200, 0)": 3
}
},
"name": "CreateItemAsync",
"start datetime": "2025-03-10T20:37:36.289Z",
"duration in milliseconds": 5012.8192,
"data": {
"Client Configuration": {
"Client Created Time Utc": "2025-03-10T20:27:48.6537870Z",
"MachineId": "hashedMachineName:dd823358-1397-c938-1a2d-e52a0b922240",
"NumberOfClientsCreated": 1,
"NumberOfActiveClients": 1,
"ConnectionMode": "Direct",
"User Agent": "cosmos-netstandard-sdk/3.47.2|1|X64|Microsoft Windows 10.0.26100|.NET 8.0.13|L|dkunda-ppaf-writer-app",
"ConnectionConfig": {
"gw": "(cps:50, urto:6, p:False, httpf: False)",
"rntbd": "(cto: 5, icto: -1, mrpc: 30, mcpe: 65535, erd: True, pr: ReuseUnicastPort)",
"other": "(ed:False, be:False)"
},
"ConsistencyConfig": "(consistency: Session, prgns:[North Central US, Central US, West US 2], apprgn: )",
"ProcessorCount": 12
}
},
"children": [
{
"name": "ItemSerialize",
"duration in milliseconds": 0.0391
},
{
"name": "Microsoft.Azure.Cosmos.Handlers.RequestInvokerHandler",
"duration in milliseconds": 5012.3797,
"children": [
{
"name": "Microsoft.Azure.Cosmos.Handlers.DiagnosticsHandler",
"duration in milliseconds": 5012.2604,
"children": [
{
"name": "Microsoft.Azure.Cosmos.Handlers.TelemetryHandler",
"duration in milliseconds": 5012.2034,
"children": [
{
"name": "Microsoft.Azure.Cosmos.Handlers.RetryHandler",
"duration in milliseconds": 5012.1412,
"children": [
{
"name": "Microsoft.Azure.Cosmos.Handlers.RouterHandler",
"duration in milliseconds": 5012.0268,
"children": [
{
"name": "Microsoft.Azure.Cosmos.Handlers.TransportHandler",
"duration in milliseconds": 5011.9925,
"children": [
{
"name": "Microsoft.Azure.Documents.ServerStoreModel Transport Request",
"duration in milliseconds": 5011.821,
"data": {
"Client Side Request Stats": {
"Id": "AggregatedClientSideRequestStatistics",
"ContactedReplicas": [
{
"Count": 1,
"Uri": "rntbd://cdb-ms-test61-northcentralus1-be1.documents-test.windows-int.net:14008/apps/bad4fbdb-e7dd-45be-9480-7d9fd240a2a5/services/3385fbba-d551-4f50-a6a9-c92c49bd48f6/partitions/b04f9c0b-defd-46d9-a8ed-0f275fb96430/replicas/133859571086460706s/"
}
],
"RegionsContacted": [
],
"FailedReplicas": [
],
"ForceAddressRefresh": [
{
"No change to cache": [
"rntbd://cdb-ms-test61-northcentralus1-be1.documents-test.windows-int.net:14008/apps/bad4fbdb-e7dd-45be-9480-7d9fd240a2a5/services/3385fbba-d551-4f50-a6a9-c92c49bd48f6/partitions/b04f9c0b-defd-46d9-a8ed-0f275fb96430/replicas/133859571086460706s/"
]
},
{
"No change to cache": [
"rntbd://cdb-ms-test61-northcentralus1-be1.documents-test.windows-int.net:14008/apps/bad4fbdb-e7dd-45be-9480-7d9fd240a2a5/services/3385fbba-d551-4f50-a6a9-c92c49bd48f6/partitions/b04f9c0b-defd-46d9-a8ed-0f275fb96430/replicas/133859571086460706s/"
]
},
{
"No change to cache": [
"rntbd://cdb-ms-test61-northcentralus1-be1.documents-test.windows-int.net:14008/apps/bad4fbdb-e7dd-45be-9480-7d9fd240a2a5/services/3385fbba-d551-4f50-a6a9-c92c49bd48f6/partitions/b04f9c0b-defd-46d9-a8ed-0f275fb96430/replicas/133859571086460706s/"
]
}
],
"AddressResolutionStatistics": [
{
"StartTimeUTC": "2025-03-10T20:37:36.2907018Z",
"EndTimeUTC": "2025-03-10T20:37:36.4220266Z",
"TargetEndpoint": "https://dkunda-ppaf-session-northcentralus.documents-test.windows-int.net//addresses/?$resolveFor=dbs%2fLHw67A%3d%3d%2fcolls%2fLHw67K25IVs%3d%2fdocs&$filter=protocol eq rntbd&$partitionKeyRangeIds=0"
},
{
"StartTimeUTC": "2025-03-10T20:37:37.4364198Z",
"EndTimeUTC": "2025-03-10T20:37:37.5116960Z",
"TargetEndpoint": "https://dkunda-ppaf-session-northcentralus.documents-test.windows-int.net//addresses/?$resolveFor=dbs%2fLHw67A%3d%3d%2fcolls%2fLHw67K25IVs%3d%2fdocs&$filter=protocol eq rntbd&$partitionKeyRangeIds=0"
},
{
"StartTimeUTC": "2025-03-10T20:37:39.5211252Z",
"EndTimeUTC": "2025-03-10T20:37:39.6584297Z",
"TargetEndpoint": "https://dkunda-ppaf-session-northcentralus.documents-test.windows-int.net//addresses/?$resolveFor=dbs%2fLHw67A%3d%3d%2fcolls%2fLHw67K25IVs%3d%2fdocs&$filter=protocol eq rntbd&$partitionKeyRangeIds=0"
}
],
"StoreResponseStatistics": [
],
"HttpResponseStats": [
{
"StartTimeUTC": "2025-03-10T20:37:36.2907438Z",
"DurationInMs": 69.7863,
"RequestUri": "https://dkunda-ppaf-session-northcentralus.documents-test.windows-int.net//addresses/?$resolveFor=dbs%2fLHw67A%3d%3d%2fcolls%2fLHw67K25IVs%3d%2fdocs&$filter=protocol eq rntbd&$partitionKeyRangeIds=0",
"ResourceType": "Document",
"HttpMethod": "GET",
"ActivityId": "1b8c5396-c3bd-447a-9ade-a17b498e2815",
"StatusCode": "OK"
},
{
"StartTimeUTC": "2025-03-10T20:37:37.4364448Z",
"DurationInMs": 75.2017,
"RequestUri": "https://dkunda-ppaf-session-northcentralus.documents-test.windows-int.net//addresses/?$resolveFor=dbs%2fLHw67A%3d%3d%2fcolls%2fLHw67K25IVs%3d%2fdocs&$filter=protocol eq rntbd&$partitionKeyRangeIds=0",
"ResourceType": "Document",
"HttpMethod": "GET",
"ActivityId": "1b8c5396-c3bd-447a-9ade-a17b498e2815",
"StatusCode": "OK"
},
{
"StartTimeUTC": "2025-03-10T20:37:39.5211781Z",
"DurationInMs": 73.2295,
"RequestUri": "https://dkunda-ppaf-session-northcentralus.documents-test.windows-int.net//addresses/?$resolveFor=dbs%2fLHw67A%3d%3d%2fcolls%2fLHw67K25IVs%3d%2fdocs&$filter=protocol eq rntbd&$partitionKeyRangeIds=0",
"ResourceType": "Document",
"HttpMethod": "GET",
"ActivityId": "1b8c5396-c3bd-447a-9ade-a17b498e2815",
"StatusCode": "OK"
}
]
}
}
}
]
}
]
}
]
}
]
}
]
}
]
},
{
"name": "CosmosOperationCanceledException",
"duration in milliseconds": 0.0125,
"data": {
"Operation Cancelled Exception": "System.Threading.Tasks.TaskCanceledException: A task was canceled.\r\n at Microsoft.Azure.Documents.RequestRetryUtility.ProcessRequestAsync[TRequest,IRetriableResponse](Func`1 executeAsync, Func`1 prepareRequest, IRequestRetryPolicy`2 policy, CancellationToken cancellationToken, Func`1 inBackoffAlternateCallbackMethod, Nullable`1 minBackoffForInBackoffCallback)\r\n at Microsoft.Azure.Documents.RequestRetryUtility.ProcessRequestAsync[TRequest,IRetriableResponse](Func`1 executeAsync, Func`1 prepareRequest, IRequestRetryPolicy`2 policy, CancellationToken cancellationToken, Func`1 inBackoffAlternateCallbackMethod, Nullable`1 minBackoffForInBackoffCallback)\r\n at Microsoft.Azure.Documents.StoreClient.ProcessMessageAsync(DocumentServiceRequest request, CancellationToken cancellationToken, IRetryPolicy retryPolicy)\r\n at Microsoft.Azure.Cosmos.Handlers.TransportHandler.ProcessMessageAsync(RequestMessage request, CancellationToken cancellationToken)\r\n at Microsoft.Azure.Cosmos.Handlers.TransportHandler.SendAsync(RequestMessage request, CancellationToken cancellationToken)\r\n at Microsoft.Azure.Cosmos.Handlers.RouterHandler.SendAsync(RequestMessage request, CancellationToken cancellationToken)\r\n at Microsoft.Azure.Cosmos.RequestHandler.SendAsync(RequestMessage request, CancellationToken cancellationToken)\r\n at Microsoft.Azure.Cosmos.Handlers.AbstractRetryHandler.ExecuteHttpRequestAsync(Func`1 callbackMethod, Func`3 callShouldRetry, Func`3 callShouldRetryException, CancellationToken cancellationToken)\r\n at Microsoft.Azure.Cosmos.Handlers.AbstractRetryHandler.SendAsync(RequestMessage request, CancellationToken cancellationToken)\r\n at Microsoft.Azure.Cosmos.RequestHandler.SendAsync(RequestMessage request, CancellationToken cancellationToken)\r\n at Microsoft.Azure.Cosmos.Handlers.TelemetryHandler.SendAsync(RequestMessage request, CancellationToken cancellationToken)\r\n at Microsoft.Azure.Cosmos.RequestHandler.SendAsync(RequestMessage request, CancellationToken cancellationToken)\r\n at Microsoft.Azure.Cosmos.Handlers.DiagnosticsHandler.SendAsync(RequestMessage request, CancellationToken cancellationToken)\r\n at Microsoft.Azure.Cosmos.RequestHandler.SendAsync(RequestMessage request, CancellationToken cancellationToken)\r\n at Microsoft.Azure.Cosmos.Handlers.RequestInvokerHandler.BaseSendAsync(RequestMessage request, CancellationToken cancellationToken)\r\n at Microsoft.Azure.Cosmos.Handlers.RequestInvokerHandler.SendAsync(RequestMessage request, CancellationToken cancellationToken)\r\n at Microsoft.Azure.Cosmos.Handlers.RequestInvokerHandler.SendAsync(String resourceUriString, ResourceType resourceType, OperationType operationType, RequestOptions requestOptions, ContainerInternal cosmosContainerCore, FeedRange feedRange, Stream streamPayload, Action`1 requestEnricher, ITrace trace, CancellationToken cancellationToken)\r\n at Microsoft.Azure.Cosmos.ContainerCore.ProcessItemStreamAsync(Nullable`1 partitionKey, String itemId, Stream streamPayload, OperationType operationType, ItemRequestOptions requestOptions, ITrace trace, Nullable`1 targetResponseSerializationFormat, CancellationToken cancellationToken)\r\n at Microsoft.Azure.Cosmos.ContainerCore.ExtractPartitionKeyAndProcessItemStreamAsync[T](Nullable`1 partitionKey, String itemId, T item, OperationType operationType, ItemRequestOptions requestOptions, ITrace trace, CancellationToken cancellationToken)\r\n at Microsoft.Azure.Cosmos.ContainerCore.CreateItemAsync[T](T item, ITrace trace, Nullable`1 partitionKey, ItemRequestOptions requestOptions, CancellationToken cancellationToken)\r\n at Microsoft.Azure.Cosmos.ClientContextCore.RunWithDiagnosticsHelperAsync[TResult](String containerName, String databaseName, OperationType operationType, ITrace trace, Func`2 task, Nullable`1 openTelemetry, RequestOptions requestOptions, Nullable`1 resourceType)"
}
}
]
}
Acceptance Criteria:
- SDK should apply partition level regional override for the faulty partition.
Metadata
Metadata
Assignees
Type
Projects
Status
Done