Skip to content

fix(recovery): make audit recovery global and token requests idempotent#1528

Open
Aman-Cool wants to merge 1 commit intohyperledger-labs:mainfrom
Aman-Cool:fix/commit-pipeline-global-recovery
Open

fix(recovery): make audit recovery global and token requests idempotent#1528
Aman-Cool wants to merge 1 commit intohyperledger-labs:mainfrom
Aman-Cool:fix/commit-pipeline-global-recovery

Conversation

@Aman-Cool
Copy link
Copy Markdown
Contributor

Following up on the review feedback from #1522, here's the more global approach.

Audit crash recovery

Same split-brain scenario as the ttx side (PR #1507): node crashes after auditDB.Append writes the token record but before SetStatus(Confirmed) runs. On restart, RestoreTMS re-registers a finality listener for the stuck Pending record; but if the finality event was already delivered before the crash, it won't come again and the record stays Pending forever.

The fix: before adding the listener, check whether the tokens are already in tokenDB. If they are, set Confirmed directly and move on. Extracted into a standalone recoverAuditCommittedPending function so it's easy to call from other recovery paths as the codebase evolves.

Idempotent token request writes

AddTokenRequest was doing a plain INSERT with a PRIMARY KEY on tx_id. Any view retry after a transient failure would hit a UNIQUE VIOLATION and fail permanently instead of recovering. Added ON CONFLICT DO NOTHING so replaying the write is safe; the record is already there, nothing to do.

Together these close the two gaps pointed out in the review. Happy to adjust either if the direction isn't quite right.

- Add crash-recovery to auditor RestoreTMS: if tokens were already
  committed to tokenDB but auditDB status is still Pending (node crashed
  between Append and SetStatus), heal the record directly on restart
  instead of waiting for a finality event that may not be re-delivered
- Make AddTokenRequest idempotent with ON CONFLICT DO NOTHING so view
  retries after a transient failure no longer hit a UNIQUE VIOLATION on
  tx_id; makes the token-request write safe to replay at any point

Signed-off-by: Aman-Cool <aman017102007@gmail.com>
@Aman-Cool
Copy link
Copy Markdown
Contributor Author

@adecaro, Opened this as a follow-up to the review feedback on #1522. Kept the audit recovery logic in its own function so it's not locked to the bootstrap path; easier to call from other places if needed down the line.
The ON CONFLICT DO NOTHING on the token request insert handles the retry case cleanly without any extra plumbing.

Let me know what you think.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant