Fix mqbc: Cluster FSM must heal before starting Partition FSMs #951

kaikulimu · 2025-09-30T19:26:19Z

Today in the Cluster FSM, leader becomes healed upon the CSL commit success of the first leader advisory, where it will initialize the queue key info map on the cluster thread. Near identical logic exists at the follower node.

At the same time, CSL commit callback fires and triggers the onPartitionPrimaryAssignment observer, which jumpstarts the Partition FSM. The Partition FSM will eventually attempt to open FileStore in the partition thread, which will access the queue key info map. There is a slight chance of race condition, so let's fix it.

The fix consists of two parts:

We always update the partition information before triggering the corresponding transitions in the Partition FSMs, so that PFSMs always have access to latest info. As a result, the PFSM actions do_storePartitionInfo and do_clearPartitionInfo are removed.
We make sure PFSMs are only jumpstarted after queue key info map is initialized. We achieve this by delay calling processPrimaryDetect/processReplicaDetect until initializeQueueKeyInfoMap has completed.

After the fix, we can observe that the events happen in the correct order -- Cluster FSM becomes healed before any Partition FSM can start:

east1            16Dec2025_22:55:29.741 (     6174732288) INFO     *qbc.clusterstatemanager mqbc_clusterstatemanager.cpp:1032 Cluster (itCluster): Committed advisory: [ rId = NULL choice = [ clusterMessage = [ choice = [ leaderAdvisory = [ sequenceNumber = [ electorTerm = 1 sequenceNumber = 1 ] partitions = [ [ partitionId = 0 primaryNodeId = 1 primaryLeaseId = 1 ] [ partitionId = 1 primaryNodeId = 1 primaryLeaseId = 1 ] [ partitionId = 2 primaryNodeId = 1 primaryLeaseId = 1 ] [ partitionId = 3 primaryNodeId = 1 primaryLeaseId = 1 ] ] queues = [ ] ] ] ] ] ], with status 'SUCCESS'
east1            16Dec2025_22:55:29.779 (     6174732288) INFO     *bmqbrkr.mqbc.clusterfsm mqbc_clusterfsm.cpp:93  Cluster FSM on Event 'CSL_CMT_SUCCESS', transition: State 'LDR_HEALING_STG2' =>  State 'LDR_HEALED'
east1            16Dec2025_22:55:29.842 (     6171291648) INFO     *rkr.mqbc.storagemanager mqbc_storagemanager.cpp:508 Cluster (itCluster) Partition [0]: Self Transition to Primary in the Partition FSM.
east1            16Dec2025_22:55:29.848 (     6171291648) INFO     *qbrkr.mqbc.partitionfsm mqbc_partitionfsm.cpp:78  Partition FSM for Partition [0] on Event 'DETECT_SELF_PRIMARY', transition: State 'UNKNOWN' =>  State 'PRIMARY_HEALING_STG1'

bmq-oss-ci

Build 3173 of commit f9d52c6 has completed with FAILURE

dorjesinpo

Can't we call initializeQueueKeyInfoMap (unconditionally) before any FSM event (and not as FSM event) to avoid any possibility for a race?

dorjesinpo · 2026-01-02T16:20:21Z

src/groups/mqb/mqbblp/mqbblp_storagemanager.cpp

    mqbs::FileStore* fs    = d_fileStores[partitionId].get();
    PartitionInfo&   pinfo = d_partitionInfoVec[partitionId];

+    if (primary != pinfo.primary()) {


There seems to be the same check in mqbc::StorageUtil::clearPrimaryForPartition already (without logging)

The same check used to be in mqbc::StorageUtil::clearPrimaryForPartition, but I removed it because I want to call this util method during mqbc::StorageManager::processShutdownEventDispatched and mqbc::StorageManager::stop, neither of which has easy access to mqbnet::ClusterNode* primary.

Thus, I moved the check upwards to mqbblp::StorageManager::clearPrimaryForPartitionDispatched. And you reminded me to do the same check inside mqbc::StorageManager::clearPrimaryForPartitionDispatched. Will add.

dorjesinpo · 2026-01-02T16:52:24Z

src/groups/mqb/mqbc/mqbc_partitionfsm.h

                      bsl::vector<PartitionFSMEventData> >
        EventWithData;

-    struct PartitionFSMArgs {


Good simplification. Can we make the same for the Cluster FSM?

Yes, similar to Partition FSM, we should have a singleton internal queue to process FSM events, instead of creating a new event queue every time and passing it as FSM args. Will make the change.

dorjesinpo · 2026-01-02T16:57:31Z

src/groups/mqb/mqbc/mqbc_storagemanager.cpp

-                                     queueSp));
+    mqbs::FileStore* fs = d_fileStores[partitionId].get();
+    BSLS_ASSERT_SAFE(fs);
+    if (fs->inDispatcherThread()) {


Is there any motivation behind the inDispatcherThread() check other than optimization? Does it fix some race, if so which one?

Before my PR, dispatchEventToPartition was only executed in the cluster dispatcher thread. I extended the method to be possibly executed by either cluster or queue dispatcher thread. I added this inDispatcherThread() check mostly for optimization, and to prevent unexpected logic when the PFSM is enqueueing events -- dispatching an event from queue thread to itself introduces a gap and is prone to race conditions. I did not encounter a specific bug while developing the code, but I was following a similar code change by Alex in #929 in which he did report to fix a race condition. So I would say executing in place is more thread-safe in general.

dorjesinpo · 2026-01-05T15:42:28Z

src/groups/mqb/mqbc/mqbc_storagemanager.cpp

    BSLS_ASSERT_SAFE(d_dispatcher_p->inDispatcherThread(d_cluster_p));

    if (d_isQueueKeyInfoMapVecInitialized) {
-        BALL_LOG_WARN << d_clusterData_p->identity().description()


why removing this warning? Should it be an assertion?

initializeQueueKeyInfoMap needs to be called exactly once when the Cluster FSM becomes healed for the first time (either here or here), and before Partition FSMs are jumpstarted. If there is a change in leader, Cluster FSM goes through unknown to healed cycle again, but this time initializeQueueKeyInfoMap should not be called. The d_isQueueKeyInfoMapVecInitialized flag prevents the method from being called again. Therefore, we have no need to warn or alarm if we return early -- it simply indicates a change in leader.

dorjesinpo · 2026-01-05T15:47:50Z

src/groups/mqb/mqbc/mqbc_storagemanager.cpp

+    }
+
+    if (primaryNode->nodeId() ==
+        d_clusterData_p->membership().selfNode()->nodeId()) {


Should mqbnet::ClusterImp::d_selfNodeId be const?

Changed mqbnet::ClusterImp::d_name to const too

dorjesinpo · 2026-01-05T16:01:11Z

src/groups/mqb/mqbc/mqbc_storagemanager.cpp

-
-    dispatchEventToPartition(fs,
-                             PartitionFSM::Event::e_DETECT_SELF_PRIMARY,
+    dispatchEventToPartition(PartitionFSM::Event::e_DETECT_SELF_PRIMARY,


It looks like e_DETECT_SELF_PRIMARY can be enqueued twice (depending on the race). By StorageManager::initializeQueueKeyInfoMap and by setPrimaryForPartitionDispatched.
If so, this is undesirable.

Same about e_DETECT_SELF_REPLICA

This is an important design decision I made for this PR, and please suggest alternatives if you think there is a better solution. processPrimaryDetect/processReplicaDetect triggers the healing process in Partition FSMs by enqueueing an e_DETECT_SELF_PRIMARY/e_DETECT_SELF_REPLICA event. processPrimaryDetect/processReplicaDetect is called during setPrimaryForPartitionDispatched, which happens whenever we commit a CSL advisory. The first CSL advisory commit transitions us to a healed leader/follower, and at this time we must do initializeQueueKeyInfoMap before jumpstarting the Partition FSMs -- this is the race you reported that I am trying to fix. Therefore, this PR's implementation blocks processPrimaryDetect/processReplicaDetect from being called during setPrimaryForPartitionDispatched if d_isQueueKeyInfoMapVecInitialized is false. Instead, initializeQueueKeyInfoMap calls processPrimaryDetect/processReplicaDetect near the end of the method. Future calls of setPrimaryForPartitionDispatched will not be blocked by d_isQueueKeyInfoMapVecInitialized and thus will trigger the Partition FSMs normally.

kaikulimu · 2026-01-05T22:20:36Z

Can't we call initializeQueueKeyInfoMap (unconditionally) before any FSM event (and not as FSM event) to avoid any possibility for a race?

initializeQueueKeyInfoMap copies information from the cluster state into d_queueKeyInfoMapVec so that partition threads can access those information in a thread-safe manner. It must be called after Cluster FSM is healed to ensure that the cluster state is up-to-date, and before Partition FSMs attempt to open file store. Hence, we can't really call it before any FSM event.

Signed-off-by: Yuan Jing Vincent Yan <[email protected]>

bmq-oss-ci

Build 3180 of commit 038355a has completed with FAILURE

kaikulimu requested a review from dorjesinpo September 30, 2025 19:26

kaikulimu requested a review from a team as a code owner September 30, 2025 19:26

kaikulimu changed the title ~~Fix mqbc: Cluster FSM must heal before starting Partition FSMs~~ WIP Fix mqbc: Cluster FSM must heal before starting Partition FSMs Sep 30, 2025

kaikulimu self-assigned this Sep 30, 2025

kaikulimu mentioned this pull request Oct 2, 2025

Misc Partition FSM: New transitions to STOPPED state #953

Merged

kaikulimu force-pushed the qkeyinfomap-race branch 2 times, most recently from b2f0ae3 to 490d548 Compare October 17, 2025 21:01

kaikulimu force-pushed the qkeyinfomap-race branch 3 times, most recently from ae6a352 to b5df771 Compare November 21, 2025 19:31

kaikulimu force-pushed the qkeyinfomap-race branch from b5df771 to 789de0b Compare December 16, 2025 22:44

kaikulimu changed the title ~~WIP Fix mqbc: Cluster FSM must heal before starting Partition FSMs~~ Fix mqbc: Cluster FSM must heal before starting Partition FSMs Dec 16, 2025

kaikulimu force-pushed the qkeyinfomap-race branch 4 times, most recently from 965868e to f9d52c6 Compare December 20, 2025 01:35

kaikulimu requested a review from 678098 December 20, 2025 02:32

kaikulimu assigned 678098 and unassigned kaikulimu Dec 20, 2025

bmq-oss-ci bot reviewed Dec 20, 2025

View reviewed changes

kaikulimu assigned dorjesinpo and unassigned 678098 Dec 29, 2025

kaikulimu removed the request for review from 678098 December 29, 2025 22:24

dorjesinpo reviewed Jan 5, 2026

View reviewed changes

dorjesinpo assigned kaikulimu and unassigned dorjesinpo Jan 5, 2026

kaikulimu force-pushed the qkeyinfomap-race branch from 912e372 to c5414d5 Compare January 6, 2026 20:26

kaikulimu added 2 commits January 6, 2026 15:28

mqbc::PartitionStateTable: Unsupported primary downgrade

dba569b

Signed-off-by: Yuan Jing Vincent Yan <[email protected]>

mqbc::StorageMgr: 'dispatchEventToPartition' supports execute in-place

3275d45

Signed-off-by: Yuan Jing Vincent Yan <[email protected]>

kaikulimu added 6 commits January 6, 2026 15:31

mqbc: Update partition info before invoking Partition FSMs

335c84b

Signed-off-by: Yuan Jing Vincent Yan <[email protected]>

mqbmock::Cluster: Use mqbmock::Domain and DomainFactory

f14eebd

Signed-off-by: Yuan Jing Vincent Yan <[email protected]>

mqbc::StorageMgr: Primary/replica detect after init queueKeyInfoMap

cdf7c76

Signed-off-by: Yuan Jing Vincent Yan <[email protected]>

mqbc::PartitionFSM: Internal queue to process FSM events

c61bc14

Signed-off-by: Yuan Jing Vincent Yan <[email protected]>

mqbc::StorageMgr: Sanity check when clear primary

632df5d

Signed-off-by: Yuan Jing Vincent Yan <[email protected]>

mqbc::ClusterFSM: Internal queue to process FSM events

51d45b3

Signed-off-by: Yuan Jing Vincent Yan <[email protected]>

kaikulimu force-pushed the qkeyinfomap-race branch from c5414d5 to 51d45b3 Compare January 6, 2026 20:33

mqbnet::ClusterImp: constness

038355a

Signed-off-by: Yuan Jing Vincent Yan <[email protected]>

bmq-oss-ci bot reviewed Jan 7, 2026

View reviewed changes

kaikulimu assigned dorjesinpo and unassigned kaikulimu Jan 7, 2026

Fix mqbc: Cluster FSM must heal before starting Partition FSMs #951

Are you sure you want to change the base?

Fix mqbc: Cluster FSM must heal before starting Partition FSMs #951

Conversation

kaikulimu commented Sep 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bmq-oss-ci bot left a comment

Choose a reason for hiding this comment

Uh oh!

dorjesinpo left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kaikulimu Jan 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kaikulimu Jan 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kaikulimu Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kaikulimu Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kaikulimu commented Jan 5, 2026

Uh oh!

bmq-oss-ci bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

kaikulimu commented Sep 30, 2025 •

edited

Loading

kaikulimu Jan 5, 2026 •

edited

Loading

kaikulimu Jan 5, 2026 •

edited

Loading

kaikulimu Jan 6, 2026 •

edited

Loading

kaikulimu Jan 6, 2026 •

edited

Loading