Skip to content

[Internal] DTS: Adds retries in DTS when isRetriable is true and on timeout#5689

Open
Meghana-Palaparthi wants to merge 25 commits intomainfrom
users/Meghana-Palaparthi/DTS_timeout_handling
Open

[Internal] DTS: Adds retries in DTS when isRetriable is true and on timeout#5689
Meghana-Palaparthi wants to merge 25 commits intomainfrom
users/Meghana-Palaparthi/DTS_timeout_handling

Conversation

@Meghana-Palaparthi
Copy link
Copy Markdown
Contributor

@Meghana-Palaparthi Meghana-Palaparthi commented Mar 10, 2026

Description

This pull request introduces robust retry logic for distributed transaction commits and improves the handling and parsing of distributed transaction responses in the Cosmos DB SDK. The changes ensure that commit operations are retried safely in the event of timeouts or retriable errors, enhance diagnostics, and make response parsing more resilient. Additionally, the request and response classes are refactored for safer stream handling and improved reliability.

Reference doc for DTX retries: dtx-sdk-response-status-codes.md

Distributed transaction commit improvements:

  • Added exponential backoff retry logic for distributed transaction commits, specifically handling timeouts and retriable errors with idempotency token support in DistributedTransactionCommitter
  • Improved error handling to distinguish between cancellation and other exceptions during commit attempts.

Response parsing and diagnostics enhancements:

  • Enhanced distributed transaction response parsing to extract isRetriable and serverDiagnostics fields, and improved resilience to partial JSON parsing failures.
  • Added the IsRetriable property to DistributedTransactionResponse and ensured it is correctly populated from server responses.

Request stream handling improvements:

  • Refactored DistributedTransactionServerRequest to use a pre-serialized byte array for the request body, enabling safe creation of new memory streams for each retry and preventing disposal issues.

Reliability and correctness fixes:

  • Ensured proper disposal checks in enumerator and count properties of DistributedTransactionResponse.
  • Improved deserialization error handling in DistributedTransactionOperationResult to throw explicit exceptions on failure.

Miscellaneous:

  • Minor cleanup and refactoring for resource URI handling and idempotency token extraction.

Type of change

Please delete options that are not relevant.

  • [] Bug fix (non-breaking change which fixes an issue)
  • [✓] New feature (non-breaking change which adds functionality)
  • [] Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • [] This change requires a documentation update

Closing issues

To automatically close an issue: closes #IssueNumber

@xinlian12
Copy link
Copy Markdown
Member

@sdkReviewAgent

@xinlian12
Copy link
Copy Markdown
Member

sdkReviewAgent | Status: ⏳ Queued

Review requested by @xinlian12. I'll start shortly.

@xinlian12
Copy link
Copy Markdown
Member

sdkReviewAgent | Status: 🔍 Reviewing

I'm reviewing this PR now. I'll post my findings as comments when done.

# Conflicts:
#	Microsoft.Azure.Cosmos/src/DistributedTransaction/DistributedTransactionCommitter.cs
#	Microsoft.Azure.Cosmos/src/DistributedTransaction/DistributedTransactionOperationResult.cs
#	Microsoft.Azure.Cosmos/tests/Microsoft.Azure.Cosmos.Tests/DistributedTransaction/DistributedTransactionCommitterTests.cs
@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines:
Successfully started running 1 pipeline(s).

@xinlian12
Copy link
Copy Markdown
Member

Review complete (40:04)

Posted 1 inline comment(s).

Steps: ✓ context, correctness, cross-sdk, design, history, past-prs, synthesis, test-coverage

@xinlian12
Copy link
Copy Markdown
Member

Review complete (37:13)

Posted 1 inline comment(s).

Steps: ✓ context, correctness, cross-sdk, design, history, past-prs, synthesis, test-coverage

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines:
Successfully started running 1 pipeline(s).

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines:
Successfully started running 1 pipeline(s).

Refactor DTX retry policy into ClientRetryPolicy

- Add DTX-specific retry block in ClientRetryPolicy.ShouldRetryInternalAsync gated on IsDistributedTransactionRequest, covering:
    * 449/5352 (DtcCoordinatorRaceConflict): honors Retry-After header
    * 500/5411,5412,5413 (DtcLedgerFailure, DtcAccountConfigFailure, DtcDispatchFailure): transient infra failures safe to retry via idempotency token.

- Uses dedicated dtxRetryCount field (max 10) to avoid the single-master write restrictions in ShouldRetryOnUnavailableEndpointStatusCodes.
- 408 and 403.3 were already handled by the pipeline for all request types.
- 429/3200 (DtcLedgerThrottled) was already handled by ResourceThrottleRetryPolicy.

- Simplify DistributedTransactionCommitter.ExecuteCommitWithRetryAsync to only loop on the isRetriable JSON body flag (the one signal not visible at the pipeline level). Remove IsRetriableByStatusCode and the CosmosException-when-408 catch block.

- Update ClientRetryPolicyTests with DTX-specific tests:
    * 408, 449/5352, 500/5411-5413 retry correctly for DTX requests
    * Non-DTX requests are NOT retried on the same codes (gating tests)
    * Budget exhaustion (10 retries) returns NoRetry

- Update DistributedTransactionCommitterTests: remove all status-code retry tests (now covered in ClientRetryPolicyTests), rename tests that assumed CosmosException was caught by the outer loop.

Co-Authored-By: Copilot <223556219+Copilot@users.noreply.github.com>
@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines:
Successfully started running 1 pipeline(s).

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines:
Successfully started running 1 pipeline(s).

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines:
Successfully started running 1 pipeline(s).

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines:
Successfully started running 1 pipeline(s).

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines:
Successfully started running 1 pipeline(s).

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines:
Successfully started running 1 pipeline(s).

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines:
Successfully started running 1 pipeline(s).

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines:
Successfully started running 1 pipeline(s).

{
private static readonly TimeSpan DefaultRetryBaseDelay = TimeSpan.FromSeconds(1);

internal const int MaxIsRetriableRetryCount = 100;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am curious, if this retries 100 times, and inside ClientRetryPolicy we retry 100 times, wouldn't it be like 100k retries for a single request?
To avoid this confusion, maybe we can combine these two retry mechanisms into a single layer?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, I think worst this is 10k retries on a single request (before throttling) . If we want to combine maybe a passing response header to the client retry policy could work? If we wanted to keep the 2 retry layers and not have a new header we can probably just limit it to 3-5 retries here to be safe.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants