Skip to content

Conversation

@ckeshava
Copy link
Contributor

High Level Overview of Change

feat: introduce retry logic for victim transactions during DB deadlock. Add unit tests to simulate this scenario

The database deadlock issue has been observed a few times over the last 30 days in the staging and prod environments of the VHS. The VHS encounters a fatal crash after the occurrence of the deadlock, due to access of an undefined variable. Both of these issues are mitigated in this PR. A new exponential transaction retry mechanism has been added.

Unfortunately, we do not have database transaction logs that go beyond 3-4 days time period. So, it is impossible to determine which two competing database transactions caused this deadlock.

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Refactor (non-breaking change that only restructures code)
  • Tests (You added tests for code that already exists, or your new feature included in this PR)
  • Documentation Updates
  • Release

Test Plan

Unit tests that simulate a database deadlock have been added.

@ckeshava ckeshava requested review from kuan121 and pdp2121 December 2, 2025 20:43
pdp2121
pdp2121 previously approved these changes Dec 2, 2025
@pdp2121
Copy link
Collaborator

pdp2121 commented Dec 2, 2025

Do we know why deadlock happened specifically for this call? Or do we just happen to find issue with this in the log?

@ckeshava
Copy link
Contributor Author

ckeshava commented Dec 2, 2025

Do we know why deadlock happened specifically for this call? Or do we just happen to find issue with this in the log?

The database reported an occurrence of deadlock between the manifest and validators table, originating from the diff in this PR. However, the database did not capture the query/transaction logs, so it is not possible to find the exact SQL commands which caused deadlock with each other.

I have requested the platform team to make the necessary changes to ensure that future occurrences of deadlock have more information.

.where({ master_key: manifest.master_key })
.andWhere('seq', '>', manifest.seq)
.catch((err) =>
log.error('Error revoking current manifest', err),
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Can we update this comment? We aren't revoking anything, we are just trying to find new manifests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated the error message in 0ef23c2

// eslint-disable-next-line max-depth -- DB deadlock needs special retry logic
if (err instanceof Error && 'code' in err && err.code === '40P01') {
log.error(
'Error revoking older manifests: Deadlock detected, retrying with Exponential Backoff',
Copy link

@kuan121 kuan121 Dec 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The log message retrying with Exponential Backoff can be misleading when the current attempt is the final one. To clarify, we could either use a different message for the last attempt or include both the current attempt and the maximum number of attempts in the log.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated the log message in 0ef23c2

Comment on lines 68 to 70
await new Promise(function executor(resolve, _reject) {
setTimeout(resolve, 3000)
})
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The await on L67 wait for the completion of handleRevocation. What's the purpose of this 3-second await?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have explained the intent in the comments in lines 63-65. Let me know if I need to rephrase that.

I don't want the expect(...) statements to be evaluated until all the exponential backoff attempts are completed by the database. In this specific test, the mock returns the deadlock error twice, before executing the DB transaction. Hence, it needs to wait for a total of 1s + 2s = 3000 ms.

Copy link

@kuan121 kuan121 Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless my understanding of await is wrong, but await handleRevocations(manifest) already takes at least 3 seconds to complete due to the three write attempts and the two exponential backoff intervals. Since await is blocking, by the time const updated = await handleRevocations(manifest) resolves, the backoff delays have already occurred. Adding an additional

    await new Promise(function executor(resolve, _reject) {
      setTimeout(resolve, 3000)
    })

introduces an unnecessary 3-second delay.

Also, since the exponential backoff was unexpectedly updated to 2, 4, and 6 seconds—and the test still passes despite only waiting for 3 seconds—this indirectly shows that lines 68–70 are unnecessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for pointing it out, you are correct. 35bcfda

Comment on lines +76 to +77
expect(updated.master_key).toBe('nMASTER1')
expect(updated.seq).toBe(5)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also verify the revoked field?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

incorporated this suggestion in 0ef23c2

@kuan121
Copy link

kuan121 commented Dec 2, 2025

I have requested the platform team to make the necessary changes to ensure that future occurrences of deadlock have more information.

I'm curious what specific information do we request? How do we know that this information has what we want to see? Do we have an example? Let's say a deadlock happens again after the platform change is made, we additional information do we have?

@kuan121
Copy link

kuan121 commented Dec 2, 2025

Overall, LGTM. I just have a few minor comments and questions.

@ckeshava
Copy link
Contributor Author

ckeshava commented Dec 2, 2025

I have requested the platform team to make the necessary changes to ensure that future occurrences of deadlock have more information.

I'm curious what specific information do we request? How do we know that this information has what we want to see? Do we have an example? Let's say a deadlock happens again after the platform change is made, we additional information do we have?

Hello,
At present time, the RDS logs only contain messages from the checkpointer process. Data points like the Write-ahead-log buffer usage, the check-point time, etc. However, I didn't find any examples of query-logs in the log-files.

Information about the queries will be very helpful for us to troubleshoot issues. Can you consider turning-on query-logging? Docs: Turning on query logging for your RDS for PostgreSQL DB instance - Amazon Relational Database Service

In specific, the following parameters are useful: log_lock_waits, log_statement: all, log_connections . I feel these parameters will be helpful. In course of time, we can identify other useful query-log params.

Comment on lines 68 to 70
await new Promise(function executor(resolve, _reject) {
setTimeout(resolve, 3000)
})
Copy link

@kuan121 kuan121 Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless my understanding of await is wrong, but await handleRevocations(manifest) already takes at least 3 seconds to complete due to the three write attempts and the two exponential backoff intervals. Since await is blocking, by the time const updated = await handleRevocations(manifest) resolves, the backoff delays have already occurred. Adding an additional

    await new Promise(function executor(resolve, _reject) {
      setTimeout(resolve, 3000)
    })

introduces an unnecessary 3-second delay.

Also, since the exponential backoff was unexpectedly updated to 2, 4, and 6 seconds—and the test still passes despite only waiting for 3 seconds—this indirectly shows that lines 68–70 are unnecessary.

)
// Exponential backoff
await new Promise(function executor(resolve, _reject) {
setTimeout(resolve, 2 ** numberOfAttempts * 1000)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are backing off 2, 4 and 6 seconds now instead. If you want to do 1, 2 and 4 seconds, you need to do numberOfAttempts - 1.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, I have updated this in 35bcfda

@ckeshava ckeshava requested a review from pdp2121 December 3, 2025 21:03
Copy link
Collaborator

@pdp2121 pdp2121 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ckeshava
Copy link
Contributor Author

ckeshava commented Dec 3, 2025

thank you for the quick reviews! @kuan121 @pdp2121 🙇

@ckeshava ckeshava merged commit 9246ef6 into ripple:main Dec 3, 2025
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants