Fix[fsm]: freeze as a means to synchronize FSMs by dorjesinpo · Pull Request #1257 · bloomberg/blazingmq

dorjesinpo · 2026-04-05T17:51:16Z

⏺ Problem

During cluster FSM healing, d_queueKeyInfoMapVec is built once in do_initializeQueueKeyInfoMap on the partition thread. However, queue assignment advisories can commit on the cluster thread after this point, making the map stale. When FileStore::open recovers messages from the journal, it encounters QueueOpRecord entries for newly
assigned queues that are missing from the stale map, causing recovery failure.

⏺ Solution

Added a freeze/resume mechanism to PartitionFSM, similar in spirit to a coroutine — the action chain yields at a suspension point, cross-thread work completes, and the
chain resumes where it left off. When the action chain reaches do_attemptOpenStorage, it:

Sets d_isFreezeRequested = true, which suspends the action chain at the current index and freezes the FSM (no new events are processed; they queue up).
Dispatches to the cluster thread to rebuild a fresh QueueKeyInfoMap from the current ClusterState.
Ships the map back to the partition thread, opens FileStore, and calls fsm->unfreeze() to resume the chain and drain queued events.

Key design details:

The frozen event stays in the queue as a sentinel so nested enqueueEvent calls see size() > 1 and defer, matching normal (non-frozen) behavior.
On resume, processEvent detects d_chainResumeIndex > 0, skips the state transition (already done), and re-enters executeChain at the saved index.
d_queueKeyInfoMapVec was changed from vector<QueueKeyInfoMap> to vector<shared_ptr<QueueKeyInfoMap>> for safe cross-thread handoff.

Signed-off-by: dorjesinpo <129227380+dorjesinpo@users.noreply.github.com>

dorjesinpo · 2026-04-06T21:30:49Z

@kaikulimu
TestStrongConsistency__test_kill_post_start[multi_node_fsm-strong_consistency] fails because

⏺ Summary: Queue operations proceed before node is E_AVAILABLE in FSM workflow

Root cause

ClusterQueueHelper::hasActiveAvailablePrimary (mqbblp_clusterqueuehelper.h:1289-1290) short-circuits in FSM workflow:

  if (d_cluster_p->isFSMWorkflow()) {
      return true;  // skips node status check
  }

This allows restoreStateCluster to proceed as soon as the partition primary status is E_ACTIVE, without verifying the primary node's NodeStatus is E_AVAILABLE. The
non-FSM path correctly checks ns->nodeStatus() == E_AVAILABLE.

Consequences

Lost PUT / missing ACK: Queue reopens on the new primary before it's E_AVAILABLE. retransmitPendingMessagesDispatched fires, sendPutInline hits the closed gatePut
(node session gate — requires other == E_AVAILABLE), returns e_UNAVAILABLE. Since this is not a permanent error, the code counts it as "transmitted" and clears d_appData,
destroying the message payload. No future retransmission is possible. The SC ACK is never delivered.
Failed PUSH: The delivery engine tries to PUSH the recovered SC message to the downstream consumer before the primary's own self status is E_AVAILABLE. The gatePush
(self gate — requires self == E_AVAILABLE) is closed, so the PUSH fails with e_UNAVAILABLE.

Fix

hasActiveAvailablePrimary in FSM workflow should check the primary node's NodeStatus == E_AVAILABLE, same as the non-FSM path.

This shoudl be fixed in #1190

dorjesinpo · 2026-04-06T21:32:57Z

TestAppSubscriptions::test_app_subscription_fanout[multi_node_fsm-eventual_consistency] is supposed to be fixed by #1263

dorjesinpo · 2026-04-06T22:46:47Z

And TestStrongConsistency::test_kill_post_start[multi_node-strong_consistency] is yet another problem.
The fix is in #1265

kaikulimu

Might not be the best way to solve the issue. Blocking

dorjesinpo requested a review from a team as a code owner April 5, 2026 17:51

dorjesinpo mentioned this pull request Apr 5, 2026

Fix[mqb]: drain on closeChannel #1221

Merged

dorjesinpo force-pushed the fix/freeze-fsm branch 2 times, most recently from 0a5e313 to fb947ef Compare April 5, 2026 20:34

dorjesinpo changed the title ~~freeze as a means to synchronize FSMs~~ Fix[fsm]: freeze as a means to synchronize FSMs Apr 5, 2026

freeze as a means to synchronize FSMs

78c5832

Signed-off-by: dorjesinpo <129227380+dorjesinpo@users.noreply.github.com>

dorjesinpo force-pushed the fix/freeze-fsm branch from fb947ef to 78c5832 Compare April 6, 2026 16:11

dorjesinpo assigned kaikulimu Apr 6, 2026

dorjesinpo requested a review from kaikulimu April 6, 2026 18:58

kaikulimu requested changes Apr 7, 2026

View reviewed changes

dorjesinpo assigned dorjesinpo and unassigned kaikulimu Apr 9, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix[fsm]: freeze as a means to synchronize FSMs#1257

Fix[fsm]: freeze as a means to synchronize FSMs#1257
dorjesinpo wants to merge 1 commit into
mainfrom
fix/freeze-fsm

dorjesinpo commented Apr 5, 2026

Uh oh!

dorjesinpo commented Apr 6, 2026

Uh oh!

dorjesinpo commented Apr 6, 2026

Uh oh!

dorjesinpo commented Apr 6, 2026

Uh oh!

kaikulimu left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

dorjesinpo commented Apr 5, 2026

Uh oh!

dorjesinpo commented Apr 6, 2026

Uh oh!

dorjesinpo commented Apr 6, 2026

Uh oh!

dorjesinpo commented Apr 6, 2026

Uh oh!

kaikulimu left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants