Skip to content

Fix[fsm]: freeze as a means to synchronize FSMs#1257

Open
dorjesinpo wants to merge 1 commit into
mainfrom
fix/freeze-fsm
Open

Fix[fsm]: freeze as a means to synchronize FSMs#1257
dorjesinpo wants to merge 1 commit into
mainfrom
fix/freeze-fsm

Conversation

@dorjesinpo

Copy link
Copy Markdown
Collaborator

⏺ Problem

During cluster FSM healing, d_queueKeyInfoMapVec is built once in do_initializeQueueKeyInfoMap on the partition thread. However, queue assignment advisories can commit on the cluster thread after this point, making the map stale. When FileStore::open recovers messages from the journal, it encounters QueueOpRecord entries for newly
assigned queues that are missing from the stale map, causing recovery failure.

⏺ Solution

Added a freeze/resume mechanism to PartitionFSM, similar in spirit to a coroutine — the action chain yields at a suspension point, cross-thread work completes, and the
chain resumes where it left off. When the action chain reaches do_attemptOpenStorage, it:

  1. Sets d_isFreezeRequested = true, which suspends the action chain at the current index and freezes the FSM (no new events are processed; they queue up).
  2. Dispatches to the cluster thread to rebuild a fresh QueueKeyInfoMap from the current ClusterState.
  3. Ships the map back to the partition thread, opens FileStore, and calls fsm->unfreeze() to resume the chain and drain queued events.

Key design details:

  • The frozen event stays in the queue as a sentinel so nested enqueueEvent calls see size() > 1 and defer, matching normal (non-frozen) behavior.
  • On resume, processEvent detects d_chainResumeIndex > 0, skips the state transition (already done), and re-enters executeChain at the saved index.
  • d_queueKeyInfoMapVec was changed from vector<QueueKeyInfoMap> to vector<shared_ptr<QueueKeyInfoMap>> for safe cross-thread handoff.

@dorjesinpo dorjesinpo requested a review from a team as a code owner April 5, 2026 17:51
@dorjesinpo dorjesinpo force-pushed the fix/freeze-fsm branch 2 times, most recently from 0a5e313 to fb947ef Compare April 5, 2026 20:34
@dorjesinpo dorjesinpo changed the title freeze as a means to synchronize FSMs Fix[fsm]: freeze as a means to synchronize FSMs Apr 5, 2026
Signed-off-by: dorjesinpo <129227380+dorjesinpo@users.noreply.github.com>
@dorjesinpo

Copy link
Copy Markdown
Collaborator Author

@kaikulimu
TestStrongConsistency__test_kill_post_start[multi_node_fsm-strong_consistency] fails because

⏺ Summary: Queue operations proceed before node is E_AVAILABLE in FSM workflow

Root cause

ClusterQueueHelper::hasActiveAvailablePrimary (mqbblp_clusterqueuehelper.h:1289-1290) short-circuits in FSM workflow:

  if (d_cluster_p->isFSMWorkflow()) {
      return true;  // skips node status check
  }

This allows restoreStateCluster to proceed as soon as the partition primary status is E_ACTIVE, without verifying the primary node's NodeStatus is E_AVAILABLE. The
non-FSM path correctly checks ns->nodeStatus() == E_AVAILABLE.

Consequences

  1. Lost PUT / missing ACK: Queue reopens on the new primary before it's E_AVAILABLE. retransmitPendingMessagesDispatched fires, sendPutInline hits the closed gatePut
    (node session gate — requires other == E_AVAILABLE), returns e_UNAVAILABLE. Since this is not a permanent error, the code counts it as "transmitted" and clears d_appData,
    destroying the message payload. No future retransmission is possible. The SC ACK is never delivered.
  2. Failed PUSH: The delivery engine tries to PUSH the recovered SC message to the downstream consumer before the primary's own self status is E_AVAILABLE. The gatePush
    (self gate — requires self == E_AVAILABLE) is closed, so the PUSH fails with e_UNAVAILABLE.

Fix

hasActiveAvailablePrimary in FSM workflow should check the primary node's NodeStatus == E_AVAILABLE, same as the non-FSM path.

This shoudl be fixed in #1190

@dorjesinpo

Copy link
Copy Markdown
Collaborator Author

TestAppSubscriptions::test_app_subscription_fanout[multi_node_fsm-eventual_consistency] is supposed to be fixed by #1263

@dorjesinpo

Copy link
Copy Markdown
Collaborator Author

And TestStrongConsistency::test_kill_post_start[multi_node-strong_consistency] is yet another problem.
The fix is in #1265

@kaikulimu kaikulimu left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might not be the best way to solve the issue. Blocking

@dorjesinpo dorjesinpo assigned dorjesinpo and unassigned kaikulimu Apr 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants