Skip to content

[Concurrent Hedging] - Write Requests Doesn't Hedge When Partition is in Quorum Loss and Cancellation Token is Provided #5061

@kundadebdatta

Description

@kundadebdatta

Background:

During one of the random PPAF exercise, it was identified that when the following quorum loss condition is met, and the user provides a cancellation token, SDK honors the token, however the write request doesn't hedge, given that the threshold is much lower than the cancellation token expiry time:

  • Quorum loss injected with the quorum replicas (3 out of 4 replicas are down).
  • The primary replica is specifically down.
  • A cancellation token with 5 seconds of timeout value is provided.
  • An availability strategy with a pre defined threshold of 1 second is provided.

Sample Code:

CosmosClientOptions clientOptions = new CosmosClientOptions
{
	ApplicationName = "dkunda-ppaf-app",
	EnableContentResponseOnWrite = true,
	ApplicationPreferredRegions = new List<string> { Regions.NorthCentralUS, Regions.CentralUS, Regions.WestUS2 },
	ConnectionMode = ConnectionMode.Direct,
	ConsistencyLevel = Cosmos.ConsistencyLevel.Session,
	AvailabilityStrategy = AvailabilityStrategy.CrossRegionHedgingStrategy(
		threshold: TimeSpan.FromMilliseconds(1000), // Threshold value is much lesser than the cancellation token expiry time.
		thresholdStep: TimeSpan.FromMilliseconds(50),
		enableMultiWriteRegionHedge: true),
};

CancellationTokenSource cts = new CancellationTokenSource();
cts.CancelAfter(TimeSpan.FromSeconds(5)); // Cancellation token expiry time is 5 seconds.

Comment comment = new Comment(Guid.NewGuid().ToString(), "pk", random.Next().ToString(), "[email protected]", "This document is intended for ppaf testing demo.");

ItemResponse<Comment> writeResponse = await container.CreateItemAsync<Comment>(
	item: comment,
	partitionKey: new Cosmos.PartitionKey(comment.postId),
	requestOptions: requestOptions,
	cancellationToken: cts.Token
);

Observation:

  • Write requests doesn't hedge on multiple regions.

Acceptance Criteria:

  • Write requests should hedge on other regions, mentioned in the preferred regions.

Sample Diagnostics:

Diagnostics-1
{
	"Summary": {
		"GatewayCalls": {
			"(200, 0)": 3
		}
	},
	"name": "CreateItemAsync",
	"start datetime": "2025-03-11T01:20:26.062Z",
	"duration in milliseconds": 5002.7305,
	"data": {
		"Client Configuration": {
			"Client Created Time Utc": "2025-03-11T01:19:53.2738777Z",
			"MachineId": "hashedMachineName:dd823358-1397-c938-1a2d-e52a0b922240",
			"NumberOfClientsCreated": 1,
			"NumberOfActiveClients": 1,
			"ConnectionMode": "Direct",
			"User Agent": "cosmos-netstandard-sdk/3.47.2|1|X64|Microsoft Windows 10.0.26100|.NET 6.0.36|L|F1|dkunda-ppaf-app",
			"ConnectionConfig": {
				"gw": "(cps:50, urto:6, p:False, httpf: False)",
				"rntbd": "(cto: 5, icto: -1, mrpc: 30, mcpe: 65535, erd: True, pr: ReuseUnicastPort)",
				"other": "(ed:False, be:False)"
			},
			"ConsistencyConfig": "(consistency: Session, prgns:[North Central US, Central US, West US 2], apprgn: )",
			"ProcessorCount": 12
		}
	},
	"children": [
		{
			"name": "ItemSerialize",
			"duration in milliseconds": 0.0684
		},
		{
			"name": "Microsoft.Azure.Cosmos.Handlers.RequestInvokerHandler",
			"duration in milliseconds": 5002.0814,
			"children": [
				{
					"name": "Get Collection Cache",
					"duration in milliseconds": 0.0015
				},
				{
					"name": "Microsoft.Azure.Cosmos.Handlers.DiagnosticsHandler",
					"duration in milliseconds": 5001.2278,
					"children": [
						{
							"name": "Microsoft.Azure.Cosmos.Handlers.TelemetryHandler",
							"duration in milliseconds": 5001.0083,
							"children": [
								{
									"name": "Microsoft.Azure.Cosmos.Handlers.RetryHandler",
									"duration in milliseconds": 5000.8221,
									"children": [
										{
											"name": "Microsoft.Azure.Cosmos.Handlers.RouterHandler",
											"duration in milliseconds": 5000.4928,
											"children": [
												{
													"name": "Microsoft.Azure.Cosmos.Handlers.TransportHandler",
													"duration in milliseconds": 5000.3789,
													"children": [
														{
															"name": "Microsoft.Azure.Documents.ServerStoreModel Transport Request",
															"duration in milliseconds": 5000.0713,
															"data": {
																"Client Side Request Stats": {
																	"Id": "AggregatedClientSideRequestStatistics",
																	"ContactedReplicas": [
																		{
																			"Count": 1,
																			"Uri": "rntbd://cdb-ms-test61-northcentralus1-be1.documents-test.windows-int.net:14008/apps/bad4fbdb-e7dd-45be-9480-7d9fd240a2a5/services/3385fbba-d551-4f50-a6a9-c92c49bd48f6/partitions/b04f9c0b-defd-46d9-a8ed-0f275fb96430/replicas/133859571086460706s/"
																		}
																	],
																	"RegionsContacted": [

																	],
																	"FailedReplicas": [

																	],
																	"ForceAddressRefresh": [
																		{
																			"No change to cache": [
																				"rntbd://cdb-ms-test61-northcentralus1-be1.documents-test.windows-int.net:14008/apps/bad4fbdb-e7dd-45be-9480-7d9fd240a2a5/services/3385fbba-d551-4f50-a6a9-c92c49bd48f6/partitions/b04f9c0b-defd-46d9-a8ed-0f275fb96430/replicas/133859571086460706s/"
																			]
																		},
																		{
																			"No change to cache": [
																				"rntbd://cdb-ms-test61-northcentralus1-be1.documents-test.windows-int.net:14008/apps/bad4fbdb-e7dd-45be-9480-7d9fd240a2a5/services/3385fbba-d551-4f50-a6a9-c92c49bd48f6/partitions/b04f9c0b-defd-46d9-a8ed-0f275fb96430/replicas/133859571086460706s/"
																			]
																		},
																		{
																			"No change to cache": [
																				"rntbd://cdb-ms-test61-northcentralus1-be1.documents-test.windows-int.net:14008/apps/bad4fbdb-e7dd-45be-9480-7d9fd240a2a5/services/3385fbba-d551-4f50-a6a9-c92c49bd48f6/partitions/b04f9c0b-defd-46d9-a8ed-0f275fb96430/replicas/133859571086460706s/"
																			]
																		}
																	],
																	"AddressResolutionStatistics": [
																		{
																			"StartTimeUTC": "2025-03-11T01:20:26.0641997Z",
																			"EndTimeUTC": "2025-03-11T01:20:26.1964295Z",
																			"TargetEndpoint": "https://dkunda-ppaf-session-northcentralus.documents-test.windows-int.net//addresses/?$resolveFor=dbs%2fLHw67A%3d%3d%2fcolls%2fLHw67K25IVs%3d%2fdocs&$filter=protocol eq rntbd&$partitionKeyRangeIds=0"
																		},
																		{
																			"StartTimeUTC": "2025-03-11T01:20:27.2010140Z",
																			"EndTimeUTC": "2025-03-11T01:20:27.3422914Z",
																			"TargetEndpoint": "https://dkunda-ppaf-session-northcentralus.documents-test.windows-int.net//addresses/?$resolveFor=dbs%2fLHw67A%3d%3d%2fcolls%2fLHw67K25IVs%3d%2fdocs&$filter=protocol eq rntbd&$partitionKeyRangeIds=0"
																		},
																		{
																			"StartTimeUTC": "2025-03-11T01:20:29.3649661Z",
																			"EndTimeUTC": "2025-03-11T01:20:29.7096044Z",
																			"TargetEndpoint": "https://dkunda-ppaf-session-northcentralus.documents-test.windows-int.net//addresses/?$resolveFor=dbs%2fLHw67A%3d%3d%2fcolls%2fLHw67K25IVs%3d%2fdocs&$filter=protocol eq rntbd&$partitionKeyRangeIds=0"
																		}
																	],
																	"StoreResponseStatistics": [

																	],
																	"HttpResponseStats": [
																		{
																			"StartTimeUTC": "2025-03-11T01:20:26.0642570Z",
																			"DurationInMs": 73.8448,
																			"RequestUri": "https://dkunda-ppaf-session-northcentralus.documents-test.windows-int.net//addresses/?$resolveFor=dbs%2fLHw67A%3d%3d%2fcolls%2fLHw67K25IVs%3d%2fdocs&$filter=protocol eq rntbd&$partitionKeyRangeIds=0",
																			"ResourceType": "Document",
																			"HttpMethod": "GET",
																			"ActivityId": "4fd31309-a2ac-48da-ab8a-3be945588581",
																			"StatusCode": "OK"
																		},
																		{
																			"StartTimeUTC": "2025-03-11T01:20:27.2010746Z",
																			"DurationInMs": 74.2997,
																			"RequestUri": "https://dkunda-ppaf-session-northcentralus.documents-test.windows-int.net//addresses/?$resolveFor=dbs%2fLHw67A%3d%3d%2fcolls%2fLHw67K25IVs%3d%2fdocs&$filter=protocol eq rntbd&$partitionKeyRangeIds=0",
																			"ResourceType": "Document",
																			"HttpMethod": "GET",
																			"ActivityId": "4fd31309-a2ac-48da-ab8a-3be945588581",
																			"StatusCode": "OK"
																		},
																		{
																			"StartTimeUTC": "2025-03-11T01:20:29.3650050Z",
																			"DurationInMs": 279.9272,
																			"RequestUri": "https://dkunda-ppaf-session-northcentralus.documents-test.windows-int.net//addresses/?$resolveFor=dbs%2fLHw67A%3d%3d%2fcolls%2fLHw67K25IVs%3d%2fdocs&$filter=protocol eq rntbd&$partitionKeyRangeIds=0",
																			"ResourceType": "Document",
																			"HttpMethod": "GET",
																			"ActivityId": "4fd31309-a2ac-48da-ab8a-3be945588581",
																			"StatusCode": "OK"
																		}
																	]
																}
															}
														}
													]
												}
											]
										}
									]
								}
							]
						}
					]
				},
				{
					"name": "CosmosOperationCanceledException",
					"duration in milliseconds": 0.0139,
					"data": {
						"Operation Cancelled Exception": "System.Threading.Tasks.TaskCanceledException: A task was canceled.\r\n   at Microsoft.Azure.Documents.RequestRetryUtility.ProcessRequestAsync[TRequest,IRetriableResponse](Func`1 executeAsync, Func`1 prepareRequest, IRequestRetryPolicy`2 policy, CancellationToken cancellationToken, Func`1 inBackoffAlternateCallbackMethod, Nullable`1 minBackoffForInBackoffCallback)\r\n   at Microsoft.Azure.Documents.RequestRetryUtility.ProcessRequestAsync[TRequest,IRetriableResponse](Func`1 executeAsync, Func`1 prepareRequest, IRequestRetryPolicy`2 policy, CancellationToken cancellationToken, Func`1 inBackoffAlternateCallbackMethod, Nullable`1 minBackoffForInBackoffCallback)\r\n   at Microsoft.Azure.Documents.StoreClient.ProcessMessageAsync(DocumentServiceRequest request, CancellationToken cancellationToken, IRetryPolicy retryPolicy)\r\n   at Microsoft.Azure.Cosmos.Handlers.TransportHandler.ProcessMessageAsync(RequestMessage request, CancellationToken cancellationToken) in D:\\stash\\azure-cosmos-dotnet-v3\\Microsoft.Azure.Cosmos\\src\\Handler\\TransportHandler.cs:line 122\r\n   at Microsoft.Azure.Cosmos.Handlers.TransportHandler.SendAsync(RequestMessage request, CancellationToken cancellationToken) in D:\\stash\\azure-cosmos-dotnet-v3\\Microsoft.Azure.Cosmos\\src\\Handler\\TransportHandler.cs:line 33\r\n   at Microsoft.Azure.Cosmos.Handlers.RouterHandler.SendAsync(RequestMessage request, CancellationToken cancellationToken) in D:\\stash\\azure-cosmos-dotnet-v3\\Microsoft.Azure.Cosmos\\src\\Handler\\RouterHandler.cs:line 42\r\n   at Microsoft.Azure.Cosmos.RequestHandler.SendAsync(RequestMessage request, CancellationToken cancellationToken) in D:\\stash\\azure-cosmos-dotnet-v3\\Microsoft.Azure.Cosmos\\src\\Handler\\RequestHandler.cs:line 59\r\n   at Microsoft.Azure.Cosmos.Handlers.AbstractRetryHandler.ExecuteHttpRequestAsync(Func`1 callbackMethod, Func`3 callShouldRetry, Func`3 callShouldRetryException, CancellationToken cancellationToken) in D:\\stash\\azure-cosmos-dotnet-v3\\Microsoft.Azure.Cosmos\\src\\Handler\\AbstractRetryHandler.cs:line 75\r\n   at Microsoft.Azure.Cosmos.Handlers.AbstractRetryHandler.SendAsync(RequestMessage request, CancellationToken cancellationToken) in D:\\stash\\azure-cosmos-dotnet-v3\\Microsoft.Azure.Cosmos\\src\\Handler\\AbstractRetryHandler.cs:line 28\r\n   at Microsoft.Azure.Cosmos.RequestHandler.SendAsync(RequestMessage request, CancellationToken cancellationToken) in D:\\stash\\azure-cosmos-dotnet-v3\\Microsoft.Azure.Cosmos\\src\\Handler\\RequestHandler.cs:line 59\r\n   at Microsoft.Azure.Cosmos.Handlers.TelemetryHandler.SendAsync(RequestMessage request, CancellationToken cancellationToken) in D:\\stash\\azure-cosmos-dotnet-v3\\Microsoft.Azure.Cosmos\\src\\Handler\\TelemetryHandler.cs:line 28\r\n   at Microsoft.Azure.Cosmos.RequestHandler.SendAsync(RequestMessage request, CancellationToken cancellationToken) in D:\\stash\\azure-cosmos-dotnet-v3\\Microsoft.Azure.Cosmos\\src\\Handler\\RequestHandler.cs:line 59\r\n   at Microsoft.Azure.Cosmos.Handlers.DiagnosticsHandler.SendAsync(RequestMessage request, CancellationToken cancellationToken) in D:\\stash\\azure-cosmos-dotnet-v3\\Microsoft.Azure.Cosmos\\src\\Handler\\DiagnosticsHandler.cs:line 26\r\n   at Microsoft.Azure.Cosmos.RequestHandler.SendAsync(RequestMessage request, CancellationToken cancellationToken) in D:\\stash\\azure-cosmos-dotnet-v3\\Microsoft.Azure.Cosmos\\src\\Handler\\RequestHandler.cs:line 59\r\n   at Microsoft.Azure.Cosmos.Handlers.RequestInvokerHandler.BaseSendAsync(RequestMessage request, CancellationToken cancellationToken) in D:\\stash\\azure-cosmos-dotnet-v3\\Microsoft.Azure.Cosmos\\src\\Handler\\RequestInvokerHandler.cs:line 144\r\n   at Microsoft.Azure.Cosmos.CrossRegionHedgingAvailabilityStrategy.RequestSenderAndResultCheckAsync(Func`3 sender, RequestMessage request, String hedgedRegion, CancellationToken cancellationToken, CancellationTokenSource cancellationTokenSource, ITrace trace) in D:\\stash\\azure-cosmos-dotnet-v3\\Microsoft.Azure.Cosmos\\src\\Routing\\AvailabilityStrategy\\CrossRegionHedgingAvailabilityStrategy.cs:line 298"
					}
				}
			]
		}
	]
}

Metadata

Metadata

Assignees

Labels

HedgingAny issue/feature request related to request hedging

Type

Projects

Status

In Progress

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions