fix: transactions getting permanently stuck as Pending after a node crash by Aman-Cool · Pull Request #1507 · hyperledger-labs/fabric-token-sdk

Aman-Cool · 2026-04-09T06:24:57Z

So I was digging into the commit pipeline and found something that's been quietly causing problems in production deployments.

When a token transaction gets confirmed on-chain, the finality listener does two things in sequence: first it writes all the UTXO changes to tokenDB (new outputs created, spent inputs deleted), and then it updates the transaction's status in ttxDB from Pending to Confirmed. These are two completely separate SQL transactions with nothing tying them together.

Which means if your node dies; OOM kill, power loss, someone tripping over a cable, whatever; at exactly the wrong moment between those two writes, you end up in a state where the token balances in tokenDB are perfectly correct but ttxDB still shows the transaction as Pending. Forever. Because the in-memory retry runner that would've finished the job died with the process, and the one-shot finality listener was already consumed and never gets re-registered.

The practical fallout is pretty bad. Your wallet shows the right balance (tokens are there), but any query against transaction history says the transaction never finished. Audit reports flag it as unresolved. If you have any retry logic polling for Confirmed status it just spins indefinitely. And there's no alarm, no automatic recovery, no way to know this happened unless you're manually cross-referencing two databases.

The fix lives in RestoreTMS, which already runs on startup and iterates over every Pending transaction to re-register finality listeners. Before handing each one off to the delivery service, we now do a quick check: does this txID already exist in tokenDB?

If yes: the crash happened after Step 1 but before Step 2. We know the transaction is fully committed on-chain. We call SetStatus(Confirmed) directly right there, log it, and move on. No need to wait for Fabric to re-deliver the block, which isn't even guaranteed depending on where the seek checkpoint landed.
If no: completely normal restart scenario. The transaction genuinely hasn't been committed yet. Register the finality listener as usual and let it play out.
If the existence check itself fails (transient DB error, whatever); we log a warning and fall back to the finality listener. No hard failure, no data loss.
If SetStatus fails after a positive existence check; same thing, fall back to the listener. Belt and suspenders.

Both operations are idempotent by design. tokens.Append has an explicit existence guard so calling it twice is a no-op. ttxDB.SetStatus is a plain SQL UPDATE so writing Confirmed to an already-Confirmed row does nothing. There's no risk of double-applying anything.

To make the recovery logic actually testable without spinning up real databases, the core check got extracted into recoverCommittedPending; a small standalone function that takes two narrow interfaces (tokenExistenceChecker and pendingStatusSetter) instead of the concrete storage types. Five unit tests cover:

The main crash scenario; tokens committed, status gets healed to Confirmed, finality listener skipped
Normal restart; tokens not yet there, listener registered as usual
Existence check blows up; graceful fallback, no panic
SetStatus blows up; graceful fallback, no data corruption
Calling it twice for the same txID; both calls succeed cleanly, second one is a no-op in production

This is particularly nasty in high-throughput environments like CBDC pilots or supply-chain tokenization platforms where you're processing a lot of transactions and a routine rolling restart or unexpected crash is basically guaranteed to hit this window eventually. The failure is completely silent: no error surfaced to the operator, no alert, just a Pending record that never moves and an audit trail with a hole in it.

After this fix, any node that restarts after hitting this state will automatically heal itself during RestoreTMS before it starts accepting new work. No manual intervention, no DB surgery, no need to replay blocks.

…estart RestoreTMS now checks tokenDB before re-registering a finality listener. If TransactionExists returns true, the node crashed after tokens.Append succeeded but before ttxDB.SetStatus(Confirmed) ran. The status is healed directly instead of waiting for Fabric block re-delivery, which is not guaranteed. Extracts the logic into recoverCommittedPending (narrow interfaces) and adds five unit tests covering the happy path, normal restart, and all error fallbacks. Signed-off-by: Aman-Cool <aman017102007@gmail.com>

Aman-Cool · 2026-04-09T06:30:42Z

Hey @adecaro, found a nasty silent one; if the node crashes between the two DB writes in runOnStatus, the transaction gets stuck as Pending forever even though everything on-chain is perfectly fine. No alert, no recovery, just a ghost audit record that never resolves.

Fix hooks into the existing RestoreTMS startup scan and does a quick tokenDB cross-check to catch and heal this case automatically on restart. Small change, both sides idempotent, added tests.

Would love a second pair of eyes from someone who knows the commit pipeline well :)

adecaro · 2026-04-10T14:47:54Z

Hi @Aman-Cool , please run make fmt and make lint-auto-fix. They should resolve the current issues reported by the CI. Thanks 🙏

Signed-off-by: Aman-Cool <aman017102007@gmail.com>

Aman-Cool force-pushed the fix/commit-pipeline-atomicity branch from 6cb1458 to 691e428 Compare April 9, 2026 06:26

Aman-Cool changed the title ~~fix(ttx): recover Pending tx whose tokens were already committed on r…~~ fix: transactions getting permanently stuck as Pending after a node crash Apr 9, 2026

Aman-Cool mentioned this pull request Apr 10, 2026

fix: plug five silent asset-safety bugs across selector, storage, and htlc #1522

Merged

adecaro force-pushed the fix/commit-pipeline-atomicity branch from 691e428 to d2e0dba Compare April 10, 2026 14:28

style: apply gofmt fixes to manager.go and manager_recover_test.go

dd7d1d6

Signed-off-by: Aman-Cool <aman017102007@gmail.com>

Aman-Cool force-pushed the fix/commit-pipeline-atomicity branch from d2e0dba to dd7d1d6 Compare April 10, 2026 15:01

style: fix nlreturn and usetesting lint violations

afee84a

Signed-off-by: Aman-Cool <aman017102007@gmail.com>

Aman-Cool force-pushed the fix/commit-pipeline-atomicity branch from 79fcb49 to afee84a Compare April 10, 2026 16:18

Aman-Cool mentioned this pull request Apr 11, 2026

fix(recovery): make audit recovery global and token requests idempotent #1528

Open

Merge branch 'main' into fix/commit-pipeline-atomicity

38aec28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: transactions getting permanently stuck as Pending after a node crash#1507

fix: transactions getting permanently stuck as Pending after a node crash#1507
Aman-Cool wants to merge 4 commits intohyperledger-labs:mainfrom
Aman-Cool:fix/commit-pipeline-atomicity

Aman-Cool commented Apr 9, 2026

Uh oh!

Aman-Cool commented Apr 9, 2026

Uh oh!

adecaro commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Aman-Cool commented Apr 9, 2026

Uh oh!

Aman-Cool commented Apr 9, 2026

Uh oh!

adecaro commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants