Commit 6f56e40
CrossRegionHedgingAvailabilityStrategy: Fixes
## CrossRegionHedgingAvailabilityStrategy: Fixes `ArgumentNullException`
race condition in hedging cancellation
### Bug
Multiple production customers reported unobserved
`ArgumentNullException: Value cannot be null. (Parameter 'request')`
crashes originating from
`CrossRegionHedgingAvailabilityStrategy.RequestSenderAndResultCheckAsync`.
The exception was surfaced as a `TaskScheduler_UnobservedTaskException`,
crashing the process. The affected code paths were:
- `ContainerCore.ReadItemAsync` → `ReadItemStreamAsync` →
`ProcessItemStreamAsync` → `RequestInvokerHandler.SendAsync` →
`CrossRegionHedgingAvailabilityStrategy.ExecuteAvailabilityStrategyAsync`
### Root Cause
A **race condition** caused by passing the wrong `CancellationToken` to
the sender delegate:
1. `ExecuteAvailabilityStrategyAsync` creates a
`hedgeRequestsCancellationTokenSource` (linked to the app-provided CT)
to coordinate hedge request lifecycle.
2. `CloneAndSendAsync` clones the request inside a `using` block and
calls `RequestSenderAndResultCheckAsync`.
3. **Bug:** `RequestSenderAndResultCheckAsync` called
`sender.Invoke(request, cancellationToken)` with the
**application-provided `CancellationToken`** — not
`hedgeRequestsCancellationTokenSource.Token`.
4. When hedge Region B returned a final result (e.g., 200 OK),
`hedgeRequestsCancellationTokenSource.Cancel()` was called.
5. **But the in-flight sender for Region A still held the app CT**
(e.g., `CancellationToken.None`), which was **never cancelled**.
6. The `CloneAndSendAsync` `using` block exited, **disposing the cloned
request**.
7. The Region A sender continued executing with a reference to the
now-disposed request → **`ArgumentNullException: Value cannot be null.
(Parameter 'request')`**.
A secondary issue: when the application CT was cancelled (e2e timeout),
the hedge timer (linked to app CT) would fire, and the old code would
blindly continue the loop attempting to clone and send new requests on a
cancelled path.
### Fix
Two changes in `CrossRegionHedgingAvailabilityStrategy.cs`:
**1. Pass `hedgeRequestsCancellationTokenSource.Token` to
`sender.Invoke()` instead of the app CT**
This ensures that when **any** hedge gets a final result and calls
`hedgeRequestsCancellationTokenSource.Cancel()`, **all** in-flight
senders immediately see their CT cancelled and stop before the cloned
request is disposed. The `CancellationTokenSource` and
`CancellationToken` parameters were also consolidated into a single
`hedgeRequestsCancellationTokenSource` parameter passed through
`CloneAndSendAsync` → `RequestSenderAndResultCheckAsync`.
**2. Add `do/while` loop to handle spurious timer completions on app CT
cancellation**
When the app CT is cancelled (e2e timeout), the hedge timer fires via
the linked CTS. The old code would `continue` the loop and try to clone
a new request. The `do/while` loop now detects
`applicationProvidedCancellationToken.IsCancellationRequested` and falls
through to consolidate existing request outcomes instead of spawning new
hedges.
### Tests Added (8 new unit tests)
| Test | Validates |
|---|---|
| `HedgeCancellationCancelsInFlightRequests_NoNullRef` | Slow primary
request's CT is cancelled when a hedge returns a final result — core
regression test |
| `SenderReceivesHedgeCancellationToken_NotAppToken` | Captures the
actual CT passed to each sender and asserts all are from the hedge CTS,
not the app CT |
| `AppCancellationDuringHedging_DoesNotSpawnNewHedgeRequests` | E2e
timeout (app CT cancelled) does not spawn new hedge requests — validates
the do/while loop fix |
| `MultiRegionHedging_RequestNotAccessedAfterDisposal` | Verifies the
cloned request is still accessible when cancellation fires — exact
scenario from the crash reports |
| `HedgeCancellation_StreamRequest_NoNullRef` | Tests the stream-based
code path (ReadItemStreamAsync) from the NullRef2/NullRef3 stack traces
|
| `PrimaryRequestFinalResult_NoAdditionalHedgesSent` | Fast primary
response skips hedging entirely |
| `AllHedgesTransientError_ReturnsLastResponse` | All regions return
transient errors — strategy returns last response without NullRef |
| `ConcurrentHedgingRequests_NoNullRef` | Stress test: 50 concurrent
hedging requests with random delays — no NullRef under concurrency |
### Type of change
- [x] Bug fix (non-breaking change which fixes an issue)
---------
Co-authored-by: Nalu Tripician <27316859+NaluTripician@users.noreply.github.com>ArgumentNullException race condition in hedging cancellation (#5613)1 parent f9d76eb commit 6f56e40
2 files changed
Lines changed: 601 additions & 25 deletions
File tree
- Microsoft.Azure.Cosmos
- src/Routing/AvailabilityStrategy
- tests/Microsoft.Azure.Cosmos.Tests
Lines changed: 61 additions & 25 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
125 | 125 | | |
126 | 126 | | |
127 | 127 | | |
128 | | - | |
| 128 | + | |
129 | 129 | | |
130 | 130 | | |
131 | 131 | | |
132 | 132 | | |
133 | 133 | | |
134 | | - | |
| 134 | + | |
135 | 135 | | |
136 | 136 | | |
137 | 137 | | |
138 | 138 | | |
139 | | - | |
| 139 | + | |
140 | 140 | | |
141 | 141 | | |
142 | 142 | | |
143 | 143 | | |
144 | | - | |
| 144 | + | |
| 145 | + | |
145 | 146 | | |
146 | 147 | | |
147 | 148 | | |
| |||
161 | 162 | | |
162 | 163 | | |
163 | 164 | | |
164 | | - | |
| 165 | + | |
165 | 166 | | |
166 | 167 | | |
167 | 168 | | |
| |||
173 | 174 | | |
174 | 175 | | |
175 | 176 | | |
176 | | - | |
177 | | - | |
| 177 | + | |
178 | 178 | | |
179 | 179 | | |
180 | 180 | | |
181 | 181 | | |
182 | | - | |
183 | | - | |
| 182 | + | |
| 183 | + | |
| 184 | + | |
| 185 | + | |
| 186 | + | |
| 187 | + | |
| 188 | + | |
| 189 | + | |
| 190 | + | |
| 191 | + | |
| 192 | + | |
| 193 | + | |
| 194 | + | |
184 | 195 | | |
185 | 196 | | |
186 | 197 | | |
187 | 198 | | |
188 | 199 | | |
189 | 200 | | |
190 | | - | |
191 | 201 | | |
| 202 | + | |
192 | 203 | | |
193 | | - | |
| 204 | + | |
194 | 205 | | |
195 | | - | |
| 206 | + | |
| 207 | + | |
| 208 | + | |
| 209 | + | |
| 210 | + | |
| 211 | + | |
| 212 | + | |
| 213 | + | |
| 214 | + | |
196 | 215 | | |
197 | 216 | | |
198 | 217 | | |
199 | 218 | | |
200 | 219 | | |
201 | | - | |
| 220 | + | |
202 | 221 | | |
203 | 222 | | |
204 | 223 | | |
| |||
227 | 246 | | |
228 | 247 | | |
229 | 248 | | |
| 249 | + | |
| 250 | + | |
| 251 | + | |
| 252 | + | |
| 253 | + | |
| 254 | + | |
| 255 | + | |
230 | 256 | | |
231 | 257 | | |
232 | 258 | | |
233 | 259 | | |
234 | 260 | | |
235 | | - | |
| 261 | + | |
236 | 262 | | |
237 | 263 | | |
238 | 264 | | |
| |||
251 | 277 | | |
252 | 278 | | |
253 | 279 | | |
254 | | - | |
| 280 | + | |
| 281 | + | |
| 282 | + | |
| 283 | + | |
| 284 | + | |
| 285 | + | |
| 286 | + | |
| 287 | + | |
| 288 | + | |
| 289 | + | |
255 | 290 | | |
256 | 291 | | |
257 | 292 | | |
| |||
264 | 299 | | |
265 | 300 | | |
266 | 301 | | |
267 | | - | |
268 | | - | |
| 302 | + | |
269 | 303 | | |
270 | 304 | | |
271 | 305 | | |
| |||
287 | 321 | | |
288 | 322 | | |
289 | 323 | | |
290 | | - | |
291 | | - | |
| 324 | + | |
292 | 325 | | |
293 | 326 | | |
294 | 327 | | |
| |||
297 | 330 | | |
298 | 331 | | |
299 | 332 | | |
300 | | - | |
301 | | - | |
| 333 | + | |
302 | 334 | | |
303 | 335 | | |
304 | 336 | | |
305 | 337 | | |
306 | | - | |
| 338 | + | |
307 | 339 | | |
308 | 340 | | |
309 | | - | |
| 341 | + | |
310 | 342 | | |
311 | | - | |
| 343 | + | |
| 344 | + | |
| 345 | + | |
312 | 346 | | |
313 | 347 | | |
314 | 348 | | |
315 | 349 | | |
316 | 350 | | |
317 | 351 | | |
318 | 352 | | |
319 | | - | |
| 353 | + | |
320 | 354 | | |
| 355 | + | |
| 356 | + | |
321 | 357 | | |
322 | 358 | | |
323 | 359 | | |
| |||
0 commit comments