-
Notifications
You must be signed in to change notification settings - Fork 14
Mitigate DB deadlock on manifest table
#401
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…k. Add unit tests to simulate this scenario
|
Do we know why deadlock happened specifically for this call? Or do we just happen to find issue with this in the log? |
The database reported an occurrence of deadlock between the manifest and validators table, originating from the diff in this PR. However, the database did not capture the query/transaction logs, so it is not possible to find the exact SQL commands which caused deadlock with each other. I have requested the platform team to make the necessary changes to ensure that future occurrences of deadlock have more information. |
src/shared/database/index.ts
Outdated
| .where({ master_key: manifest.master_key }) | ||
| .andWhere('seq', '>', manifest.seq) | ||
| .catch((err) => | ||
| log.error('Error revoking current manifest', err), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Can we update this comment? We aren't revoking anything, we are just trying to find new manifests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated the error message in 0ef23c2
src/shared/database/index.ts
Outdated
| // eslint-disable-next-line max-depth -- DB deadlock needs special retry logic | ||
| if (err instanceof Error && 'code' in err && err.code === '40P01') { | ||
| log.error( | ||
| 'Error revoking older manifests: Deadlock detected, retrying with Exponential Backoff', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The log message retrying with Exponential Backoff can be misleading when the current attempt is the final one. To clarify, we could either use a different message for the last attempt or include both the current attempt and the maximum number of attempts in the log.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated the log message in 0ef23c2
| await new Promise(function executor(resolve, _reject) { | ||
| setTimeout(resolve, 3000) | ||
| }) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The await on L67 wait for the completion of handleRevocation. What's the purpose of this 3-second await?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have explained the intent in the comments in lines 63-65. Let me know if I need to rephrase that.
I don't want the expect(...) statements to be evaluated until all the exponential backoff attempts are completed by the database. In this specific test, the mock returns the deadlock error twice, before executing the DB transaction. Hence, it needs to wait for a total of 1s + 2s = 3000 ms.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless my understanding of await is wrong, but await handleRevocations(manifest) already takes at least 3 seconds to complete due to the three write attempts and the two exponential backoff intervals. Since await is blocking, by the time const updated = await handleRevocations(manifest) resolves, the backoff delays have already occurred. Adding an additional
await new Promise(function executor(resolve, _reject) {
setTimeout(resolve, 3000)
})
introduces an unnecessary 3-second delay.
Also, since the exponential backoff was unexpectedly updated to 2, 4, and 6 seconds—and the test still passes despite only waiting for 3 seconds—this indirectly shows that lines 68–70 are unnecessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for pointing it out, you are correct. 35bcfda
| expect(updated.master_key).toBe('nMASTER1') | ||
| expect(updated.seq).toBe(5) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also verify the revoked field?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
incorporated this suggestion in 0ef23c2
I'm curious what specific information do we request? How do we know that this information has what we want to see? Do we have an example? Let's say a deadlock happens again after the platform change is made, we additional information do we have? |
|
Overall, LGTM. I just have a few minor comments and questions. |
Hello, Information about the queries will be very helpful for us to troubleshoot issues. Can you consider turning-on query-logging? Docs: Turning on query logging for your RDS for PostgreSQL DB instance - Amazon Relational Database Service In specific, the following parameters are useful: log_lock_waits, log_statement: all, log_connections . I feel these parameters will be helpful. In course of time, we can identify other useful query-log params. |
| await new Promise(function executor(resolve, _reject) { | ||
| setTimeout(resolve, 3000) | ||
| }) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless my understanding of await is wrong, but await handleRevocations(manifest) already takes at least 3 seconds to complete due to the three write attempts and the two exponential backoff intervals. Since await is blocking, by the time const updated = await handleRevocations(manifest) resolves, the backoff delays have already occurred. Adding an additional
await new Promise(function executor(resolve, _reject) {
setTimeout(resolve, 3000)
})
introduces an unnecessary 3-second delay.
Also, since the exponential backoff was unexpectedly updated to 2, 4, and 6 seconds—and the test still passes despite only waiting for 3 seconds—this indirectly shows that lines 68–70 are unnecessary.
src/shared/database/index.ts
Outdated
| ) | ||
| // Exponential backoff | ||
| await new Promise(function executor(resolve, _reject) { | ||
| setTimeout(resolve, 2 ** numberOfAttempts * 1000) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are backing off 2, 4 and 6 seconds now instead. If you want to do 1, 2 and 4 seconds, you need to do numberOfAttempts - 1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, I have updated this in 35bcfda
pdp2121
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
High Level Overview of Change
feat: introduce retry logic for victim transactions during DB deadlock. Add unit tests to simulate this scenario
The database deadlock issue has been observed a few times over the last 30 days in the staging and prod environments of the VHS. The VHS encounters a fatal crash after the occurrence of the deadlock, due to access of an undefined variable. Both of these issues are mitigated in this PR. A new exponential transaction retry mechanism has been added.
Unfortunately, we do not have database transaction logs that go beyond 3-4 days time period. So, it is impossible to determine which two competing database transactions caused this deadlock.
Type of Change
Test Plan
Unit tests that simulate a database deadlock have been added.