Skip to content

fix: transactions getting permanently stuck as Pending after a node crash#1507

Open
Aman-Cool wants to merge 4 commits intohyperledger-labs:mainfrom
Aman-Cool:fix/commit-pipeline-atomicity
Open

fix: transactions getting permanently stuck as Pending after a node crash#1507
Aman-Cool wants to merge 4 commits intohyperledger-labs:mainfrom
Aman-Cool:fix/commit-pipeline-atomicity

Conversation

@Aman-Cool
Copy link
Copy Markdown
Contributor

So I was digging into the commit pipeline and found something that's been quietly causing problems in production deployments.

When a token transaction gets confirmed on-chain, the finality listener does two things in sequence: first it writes all the UTXO changes to tokenDB (new outputs created, spent inputs deleted), and then it updates the transaction's status in ttxDB from Pending to Confirmed. These are two completely separate SQL transactions with nothing tying them together.

Which means if your node dies; OOM kill, power loss, someone tripping over a cable, whatever; at exactly the wrong moment between those two writes, you end up in a state where the token balances in tokenDB are perfectly correct but ttxDB still shows the transaction as Pending. Forever. Because the in-memory retry runner that would've finished the job died with the process, and the one-shot finality listener was already consumed and never gets re-registered.

The practical fallout is pretty bad. Your wallet shows the right balance (tokens are there), but any query against transaction history says the transaction never finished. Audit reports flag it as unresolved. If you have any retry logic polling for Confirmed status it just spins indefinitely. And there's no alarm, no automatic recovery, no way to know this happened unless you're manually cross-referencing two databases.


The fix lives in RestoreTMS, which already runs on startup and iterates over every Pending transaction to re-register finality listeners. Before handing each one off to the delivery service, we now do a quick check: does this txID already exist in tokenDB?

  • If yes: the crash happened after Step 1 but before Step 2. We know the transaction is fully committed on-chain. We call SetStatus(Confirmed) directly right there, log it, and move on. No need to wait for Fabric to re-deliver the block, which isn't even guaranteed depending on where the seek checkpoint landed.
  • If no: completely normal restart scenario. The transaction genuinely hasn't been committed yet. Register the finality listener as usual and let it play out.
  • If the existence check itself fails (transient DB error, whatever); we log a warning and fall back to the finality listener. No hard failure, no data loss.
  • If SetStatus fails after a positive existence check; same thing, fall back to the listener. Belt and suspenders.

Both operations are idempotent by design. tokens.Append has an explicit existence guard so calling it twice is a no-op. ttxDB.SetStatus is a plain SQL UPDATE so writing Confirmed to an already-Confirmed row does nothing. There's no risk of double-applying anything.


To make the recovery logic actually testable without spinning up real databases, the core check got extracted into recoverCommittedPending; a small standalone function that takes two narrow interfaces (tokenExistenceChecker and pendingStatusSetter) instead of the concrete storage types. Five unit tests cover:

  1. The main crash scenario; tokens committed, status gets healed to Confirmed, finality listener skipped
  2. Normal restart; tokens not yet there, listener registered as usual
  3. Existence check blows up; graceful fallback, no panic
  4. SetStatus blows up; graceful fallback, no data corruption
  5. Calling it twice for the same txID; both calls succeed cleanly, second one is a no-op in production

This is particularly nasty in high-throughput environments like CBDC pilots or supply-chain tokenization platforms where you're processing a lot of transactions and a routine rolling restart or unexpected crash is basically guaranteed to hit this window eventually. The failure is completely silent: no error surfaced to the operator, no alert, just a Pending record that never moves and an audit trail with a hole in it.

After this fix, any node that restarts after hitting this state will automatically heal itself during RestoreTMS before it starts accepting new work. No manual intervention, no DB surgery, no need to replay blocks.

…estart

RestoreTMS now checks tokenDB before re-registering a finality listener.
If TransactionExists returns true, the node crashed after tokens.Append
succeeded but before ttxDB.SetStatus(Confirmed) ran. The status is healed
directly instead of waiting for Fabric block re-delivery, which is not
guaranteed.

Extracts the logic into recoverCommittedPending (narrow interfaces) and
adds five unit tests covering the happy path, normal restart, and all
error fallbacks.

Signed-off-by: Aman-Cool <aman017102007@gmail.com>
@Aman-Cool Aman-Cool force-pushed the fix/commit-pipeline-atomicity branch from 6cb1458 to 691e428 Compare April 9, 2026 06:26
@Aman-Cool Aman-Cool changed the title fix(ttx): recover Pending tx whose tokens were already committed on r… fix: transactions getting permanently stuck as Pending after a node crash Apr 9, 2026
@Aman-Cool
Copy link
Copy Markdown
Contributor Author

Hey @adecaro, found a nasty silent one; if the node crashes between the two DB writes in runOnStatus, the transaction gets stuck as Pending forever even though everything on-chain is perfectly fine. No alert, no recovery, just a ghost audit record that never resolves.

Fix hooks into the existing RestoreTMS startup scan and does a quick tokenDB cross-check to catch and heal this case automatically on restart. Small change, both sides idempotent, added tests.

Would love a second pair of eyes from someone who knows the commit pipeline well :)

@adecaro
Copy link
Copy Markdown
Contributor

adecaro commented Apr 10, 2026

Hi @Aman-Cool , please run make fmt and make lint-auto-fix. They should resolve the current issues reported by the CI. Thanks 🙏

Signed-off-by: Aman-Cool <aman017102007@gmail.com>
@Aman-Cool Aman-Cool force-pushed the fix/commit-pipeline-atomicity branch from d2e0dba to dd7d1d6 Compare April 10, 2026 15:01
Signed-off-by: Aman-Cool <aman017102007@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants