Skip to content

QuorumReader code not yielding on 429#5155

Closed
dibahlfi wants to merge 10 commits intomsdata/directfrom
429-throttling-fix
Closed

QuorumReader code not yielding on 429#5155
dibahlfi wants to merge 10 commits intomsdata/directfrom
429-throttling-fix

Conversation

@dibahlfi
Copy link
Copy Markdown
Member

@dibahlfi dibahlfi commented Apr 30, 2025

When receiving 429s on a single region account with strong consistency, quorum reader code does not yield on QuorumRead and Barrier.

Single429Failure-BarrierCalls.json

429WriteBarrierCallsWithoutFix.json

429WriteBarrierCallsWithFix.json

// original issue diagnostics - QuorumRead
429-OriginalIssueWithOutFix.json

// verifying existing behavior by turning off exceptionless for 429
429-OriginalIssueWithExceptionLessTurnedoOfFor429.json

//fix Diagnostics with a large number of 429 failures- - QuorumRead
429-WithFixAll429s.json

//fix Diagnostics with fewer 429 failures resulting in eventual success- - QuorumRead
429-Withfewer429sResultsInEventualSuccess.json

// original issue diagnostics - Barrier calls
429 BarrierCallsOriginalIssueWithDiagnostics.json

// Barrier calls fix with Diagnostics
429-BarrierCallsFixWithDianogstics.json

Unit tests for QuorumReader is pending which I will submit along with the Direct code changes.

  • [] Bug fix (non-breaking change which fixes an issue)

To automatically close an issue: closes #5035

fix:refactoring tests

fix: removing secrets
Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please follow the required format: "[Internal] Category: (Adds|Fixes|Refactors|Removes) Description"

Internal should be used for PRs that have no customer impact. This flag is used to help generate the changelog to know which PRs should be included. Examples:
Diagnostics: Adds GetElapsedClientLatency to CosmosDiagnostics
PartitionKey: Fixes null reference when using default(PartitionKey)
[v4] Client Encryption: Refactors code to external project
[Internal] Query: Adds code generator for CosmosNumbers for easy additions in the future.

Comment thread Microsoft.Azure.Cosmos/src/direct/QuorumReader.cs
@@ -153,7 +153,6 @@ public async Task<ShouldRetryResult> ShouldRetryAsync(
CancellationToken cancellationToken)
{
this.retryContext = null;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Undo this file

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do. its just draft PR I will not include it in the direct package PR>


// Check if all replicas returned 429
// TO DO: check with Kiran/Fabian if 429 from one replica is going to be enough to yield but that may be turn out to be flase alarm considering calls are async so the 429 state could be transient
if (responses.All(response => response.Target.StatusCode == StatusCodes.TooManyRequests))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any or All?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think All make sense.I will remove the comment.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With exceptions is the behavior all 429's from both replica?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with one replica failure its returning successful response as expected. updated the PR with additional test case coverage and included the diagnostics for the same

readBarrierLsn: secondaryQuorumReadResult.SelectedLsn,
targetGlobalCommittedLSN: secondaryQuorumReadResult.GlobalCommittedSelectedLsn,
readMode: readMode))
#pragma warning disable CS8632 // The annotation for nullable reference types should only be used in code within a '#nullable' annotations context.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please check the QuorumWrite paths also once.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for 429 errors and implement the same there?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

if (isThrottled && throttledResponse != null)
{
// Handle throttling by returning the throttled response
DefaultTrace.TraceWarning("WritePrivateAsync: Throttling occurred during write barrier. Returning throttled response.");
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Info is fine.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets trace the status and sub status codes along.

if (responses != null)
{
return true;
// Check if all replicas returned 429
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will a helper method help avoid duplication?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants