Skip to content

[Per Partition Automatic Failover] - Apply Partition Level Failover When Cancellation is Requested on a User Provided Cancellation Token #5060

Closed
@kundadebdatta

Description

@kundadebdatta

Background:

During one of the backend drills, it was identified that when the following quorum loss condition is met, and the user provides a cancellation token, SDK honors the token, however doesn't apply the partition level fail over for the offending partition:

  • Quorum loss injected with the quorum replicas (3 out of 4 replicas are down).
  • The primary replica is specifically down.
  • A cancellation token with 5 seconds of timeout value is provided.

Observation:

  • SDK doesn't apply the partition level override and the subsequent write requests fails on the current faulty region/ partition.

Sample Diagnostics:

Diagnostics-1
{
	"Summary": {
		"GatewayCalls": {
			"(200, 0)": 3
		}
	},
	"name": "CreateItemAsync",
	"start datetime": "2025-03-10T20:37:36.289Z",
	"duration in milliseconds": 5012.8192,
	"data": {
		"Client Configuration": {
			"Client Created Time Utc": "2025-03-10T20:27:48.6537870Z",
			"MachineId": "hashedMachineName:dd823358-1397-c938-1a2d-e52a0b922240",
			"NumberOfClientsCreated": 1,
			"NumberOfActiveClients": 1,
			"ConnectionMode": "Direct",
			"User Agent": "cosmos-netstandard-sdk/3.47.2|1|X64|Microsoft Windows 10.0.26100|.NET 8.0.13|L|dkunda-ppaf-writer-app",
			"ConnectionConfig": {
				"gw": "(cps:50, urto:6, p:False, httpf: False)",
				"rntbd": "(cto: 5, icto: -1, mrpc: 30, mcpe: 65535, erd: True, pr: ReuseUnicastPort)",
				"other": "(ed:False, be:False)"
			},
			"ConsistencyConfig": "(consistency: Session, prgns:[North Central US, Central US, West US 2], apprgn: )",
			"ProcessorCount": 12
		}
	},
	"children": [
		{
			"name": "ItemSerialize",
			"duration in milliseconds": 0.0391
		},
		{
			"name": "Microsoft.Azure.Cosmos.Handlers.RequestInvokerHandler",
			"duration in milliseconds": 5012.3797,
			"children": [
				{
					"name": "Microsoft.Azure.Cosmos.Handlers.DiagnosticsHandler",
					"duration in milliseconds": 5012.2604,
					"children": [
						{
							"name": "Microsoft.Azure.Cosmos.Handlers.TelemetryHandler",
							"duration in milliseconds": 5012.2034,
							"children": [
								{
									"name": "Microsoft.Azure.Cosmos.Handlers.RetryHandler",
									"duration in milliseconds": 5012.1412,
									"children": [
										{
											"name": "Microsoft.Azure.Cosmos.Handlers.RouterHandler",
											"duration in milliseconds": 5012.0268,
											"children": [
												{
													"name": "Microsoft.Azure.Cosmos.Handlers.TransportHandler",
													"duration in milliseconds": 5011.9925,
													"children": [
														{
															"name": "Microsoft.Azure.Documents.ServerStoreModel Transport Request",
															"duration in milliseconds": 5011.821,
															"data": {
																"Client Side Request Stats": {
																	"Id": "AggregatedClientSideRequestStatistics",
																	"ContactedReplicas": [
																		{
																			"Count": 1,
																			"Uri": "rntbd://cdb-ms-test61-northcentralus1-be1.documents-test.windows-int.net:14008/apps/bad4fbdb-e7dd-45be-9480-7d9fd240a2a5/services/3385fbba-d551-4f50-a6a9-c92c49bd48f6/partitions/b04f9c0b-defd-46d9-a8ed-0f275fb96430/replicas/133859571086460706s/"
																		}
																	],
																	"RegionsContacted": [

																	],
																	"FailedReplicas": [

																	],
																	"ForceAddressRefresh": [
																		{
																			"No change to cache": [
																				"rntbd://cdb-ms-test61-northcentralus1-be1.documents-test.windows-int.net:14008/apps/bad4fbdb-e7dd-45be-9480-7d9fd240a2a5/services/3385fbba-d551-4f50-a6a9-c92c49bd48f6/partitions/b04f9c0b-defd-46d9-a8ed-0f275fb96430/replicas/133859571086460706s/"
																			]
																		},
																		{
																			"No change to cache": [
																				"rntbd://cdb-ms-test61-northcentralus1-be1.documents-test.windows-int.net:14008/apps/bad4fbdb-e7dd-45be-9480-7d9fd240a2a5/services/3385fbba-d551-4f50-a6a9-c92c49bd48f6/partitions/b04f9c0b-defd-46d9-a8ed-0f275fb96430/replicas/133859571086460706s/"
																			]
																		},
																		{
																			"No change to cache": [
																				"rntbd://cdb-ms-test61-northcentralus1-be1.documents-test.windows-int.net:14008/apps/bad4fbdb-e7dd-45be-9480-7d9fd240a2a5/services/3385fbba-d551-4f50-a6a9-c92c49bd48f6/partitions/b04f9c0b-defd-46d9-a8ed-0f275fb96430/replicas/133859571086460706s/"
																			]
																		}
																	],
																	"AddressResolutionStatistics": [
																		{
																			"StartTimeUTC": "2025-03-10T20:37:36.2907018Z",
																			"EndTimeUTC": "2025-03-10T20:37:36.4220266Z",
																			"TargetEndpoint": "https://dkunda-ppaf-session-northcentralus.documents-test.windows-int.net//addresses/?$resolveFor=dbs%2fLHw67A%3d%3d%2fcolls%2fLHw67K25IVs%3d%2fdocs&$filter=protocol eq rntbd&$partitionKeyRangeIds=0"
																		},
																		{
																			"StartTimeUTC": "2025-03-10T20:37:37.4364198Z",
																			"EndTimeUTC": "2025-03-10T20:37:37.5116960Z",
																			"TargetEndpoint": "https://dkunda-ppaf-session-northcentralus.documents-test.windows-int.net//addresses/?$resolveFor=dbs%2fLHw67A%3d%3d%2fcolls%2fLHw67K25IVs%3d%2fdocs&$filter=protocol eq rntbd&$partitionKeyRangeIds=0"
																		},
																		{
																			"StartTimeUTC": "2025-03-10T20:37:39.5211252Z",
																			"EndTimeUTC": "2025-03-10T20:37:39.6584297Z",
																			"TargetEndpoint": "https://dkunda-ppaf-session-northcentralus.documents-test.windows-int.net//addresses/?$resolveFor=dbs%2fLHw67A%3d%3d%2fcolls%2fLHw67K25IVs%3d%2fdocs&$filter=protocol eq rntbd&$partitionKeyRangeIds=0"
																		}
																	],
																	"StoreResponseStatistics": [

																	],
																	"HttpResponseStats": [
																		{
																			"StartTimeUTC": "2025-03-10T20:37:36.2907438Z",
																			"DurationInMs": 69.7863,
																			"RequestUri": "https://dkunda-ppaf-session-northcentralus.documents-test.windows-int.net//addresses/?$resolveFor=dbs%2fLHw67A%3d%3d%2fcolls%2fLHw67K25IVs%3d%2fdocs&$filter=protocol eq rntbd&$partitionKeyRangeIds=0",
																			"ResourceType": "Document",
																			"HttpMethod": "GET",
																			"ActivityId": "1b8c5396-c3bd-447a-9ade-a17b498e2815",
																			"StatusCode": "OK"
																		},
																		{
																			"StartTimeUTC": "2025-03-10T20:37:37.4364448Z",
																			"DurationInMs": 75.2017,
																			"RequestUri": "https://dkunda-ppaf-session-northcentralus.documents-test.windows-int.net//addresses/?$resolveFor=dbs%2fLHw67A%3d%3d%2fcolls%2fLHw67K25IVs%3d%2fdocs&$filter=protocol eq rntbd&$partitionKeyRangeIds=0",
																			"ResourceType": "Document",
																			"HttpMethod": "GET",
																			"ActivityId": "1b8c5396-c3bd-447a-9ade-a17b498e2815",
																			"StatusCode": "OK"
																		},
																		{
																			"StartTimeUTC": "2025-03-10T20:37:39.5211781Z",
																			"DurationInMs": 73.2295,
																			"RequestUri": "https://dkunda-ppaf-session-northcentralus.documents-test.windows-int.net//addresses/?$resolveFor=dbs%2fLHw67A%3d%3d%2fcolls%2fLHw67K25IVs%3d%2fdocs&$filter=protocol eq rntbd&$partitionKeyRangeIds=0",
																			"ResourceType": "Document",
																			"HttpMethod": "GET",
																			"ActivityId": "1b8c5396-c3bd-447a-9ade-a17b498e2815",
																			"StatusCode": "OK"
																		}
																	]
																}
															}
														}
													]
												}
											]
										}
									]
								}
							]
						}
					]
				}
			]
		},
		{
			"name": "CosmosOperationCanceledException",
			"duration in milliseconds": 0.0125,
			"data": {
				"Operation Cancelled Exception": "System.Threading.Tasks.TaskCanceledException: A task was canceled.\r\n   at Microsoft.Azure.Documents.RequestRetryUtility.ProcessRequestAsync[TRequest,IRetriableResponse](Func`1 executeAsync, Func`1 prepareRequest, IRequestRetryPolicy`2 policy, CancellationToken cancellationToken, Func`1 inBackoffAlternateCallbackMethod, Nullable`1 minBackoffForInBackoffCallback)\r\n   at Microsoft.Azure.Documents.RequestRetryUtility.ProcessRequestAsync[TRequest,IRetriableResponse](Func`1 executeAsync, Func`1 prepareRequest, IRequestRetryPolicy`2 policy, CancellationToken cancellationToken, Func`1 inBackoffAlternateCallbackMethod, Nullable`1 minBackoffForInBackoffCallback)\r\n   at Microsoft.Azure.Documents.StoreClient.ProcessMessageAsync(DocumentServiceRequest request, CancellationToken cancellationToken, IRetryPolicy retryPolicy)\r\n   at Microsoft.Azure.Cosmos.Handlers.TransportHandler.ProcessMessageAsync(RequestMessage request, CancellationToken cancellationToken)\r\n   at Microsoft.Azure.Cosmos.Handlers.TransportHandler.SendAsync(RequestMessage request, CancellationToken cancellationToken)\r\n   at Microsoft.Azure.Cosmos.Handlers.RouterHandler.SendAsync(RequestMessage request, CancellationToken cancellationToken)\r\n   at Microsoft.Azure.Cosmos.RequestHandler.SendAsync(RequestMessage request, CancellationToken cancellationToken)\r\n   at Microsoft.Azure.Cosmos.Handlers.AbstractRetryHandler.ExecuteHttpRequestAsync(Func`1 callbackMethod, Func`3 callShouldRetry, Func`3 callShouldRetryException, CancellationToken cancellationToken)\r\n   at Microsoft.Azure.Cosmos.Handlers.AbstractRetryHandler.SendAsync(RequestMessage request, CancellationToken cancellationToken)\r\n   at Microsoft.Azure.Cosmos.RequestHandler.SendAsync(RequestMessage request, CancellationToken cancellationToken)\r\n   at Microsoft.Azure.Cosmos.Handlers.TelemetryHandler.SendAsync(RequestMessage request, CancellationToken cancellationToken)\r\n   at Microsoft.Azure.Cosmos.RequestHandler.SendAsync(RequestMessage request, CancellationToken cancellationToken)\r\n   at Microsoft.Azure.Cosmos.Handlers.DiagnosticsHandler.SendAsync(RequestMessage request, CancellationToken cancellationToken)\r\n   at Microsoft.Azure.Cosmos.RequestHandler.SendAsync(RequestMessage request, CancellationToken cancellationToken)\r\n   at Microsoft.Azure.Cosmos.Handlers.RequestInvokerHandler.BaseSendAsync(RequestMessage request, CancellationToken cancellationToken)\r\n   at Microsoft.Azure.Cosmos.Handlers.RequestInvokerHandler.SendAsync(RequestMessage request, CancellationToken cancellationToken)\r\n   at Microsoft.Azure.Cosmos.Handlers.RequestInvokerHandler.SendAsync(String resourceUriString, ResourceType resourceType, OperationType operationType, RequestOptions requestOptions, ContainerInternal cosmosContainerCore, FeedRange feedRange, Stream streamPayload, Action`1 requestEnricher, ITrace trace, CancellationToken cancellationToken)\r\n   at Microsoft.Azure.Cosmos.ContainerCore.ProcessItemStreamAsync(Nullable`1 partitionKey, String itemId, Stream streamPayload, OperationType operationType, ItemRequestOptions requestOptions, ITrace trace, Nullable`1 targetResponseSerializationFormat, CancellationToken cancellationToken)\r\n   at Microsoft.Azure.Cosmos.ContainerCore.ExtractPartitionKeyAndProcessItemStreamAsync[T](Nullable`1 partitionKey, String itemId, T item, OperationType operationType, ItemRequestOptions requestOptions, ITrace trace, CancellationToken cancellationToken)\r\n   at Microsoft.Azure.Cosmos.ContainerCore.CreateItemAsync[T](T item, ITrace trace, Nullable`1 partitionKey, ItemRequestOptions requestOptions, CancellationToken cancellationToken)\r\n   at Microsoft.Azure.Cosmos.ClientContextCore.RunWithDiagnosticsHelperAsync[TResult](String containerName, String databaseName, OperationType operationType, ITrace trace, Func`2 task, Nullable`1 openTelemetry, RequestOptions requestOptions, Nullable`1 resourceType)"
			}
		}
	]
}

Acceptance Criteria:

  • SDK should apply partition level regional override for the faulty partition.

Metadata

Metadata

Assignees

Type

Projects

  • Status

    Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions