Skip to content

Conversation

@kaikulimu
Copy link
Collaborator

@kaikulimu kaikulimu commented Sep 30, 2025

Today in the Cluster FSM, leader becomes healed upon the CSL commit success of the first leader advisory, where it will initialize the queue key info map on the cluster thread. Near identical logic exists at the follower node.

At the same time, CSL commit callback fires and triggers the onPartitionPrimaryAssignment observer, which jumpstarts the Partition FSM. The Partition FSM will eventually attempt to open FileStore in the partition thread, which will access the queue key info map. There is a slight chance of race condition, so let's fix it.

The fix consists of two parts:

  1. We always update the partition information before triggering the corresponding transitions in the Partition FSMs, so that PFSMs always have access to latest info. As a result, the PFSM actions do_storePartitionInfo and do_clearPartitionInfo are removed.
  2. We make sure PFSMs are only jumpstarted after queue key info map is initialized. We achieve this by delay calling processPrimaryDetect/processReplicaDetect until initializeQueueKeyInfoMap has completed.

After the fix, we can observe that the events happen in the correct order -- Cluster FSM becomes healed before any Partition FSM can start:

east1            16Dec2025_22:55:29.741 (     6174732288) INFO     *qbc.clusterstatemanager mqbc_clusterstatemanager.cpp:1032 Cluster (itCluster): Committed advisory: [ rId = NULL choice = [ clusterMessage = [ choice = [ leaderAdvisory = [ sequenceNumber = [ electorTerm = 1 sequenceNumber = 1 ] partitions = [ [ partitionId = 0 primaryNodeId = 1 primaryLeaseId = 1 ] [ partitionId = 1 primaryNodeId = 1 primaryLeaseId = 1 ] [ partitionId = 2 primaryNodeId = 1 primaryLeaseId = 1 ] [ partitionId = 3 primaryNodeId = 1 primaryLeaseId = 1 ] ] queues = [ ] ] ] ] ] ], with status 'SUCCESS'
east1            16Dec2025_22:55:29.779 (     6174732288) INFO     *bmqbrkr.mqbc.clusterfsm mqbc_clusterfsm.cpp:93  Cluster FSM on Event 'CSL_CMT_SUCCESS', transition: State 'LDR_HEALING_STG2' =>  State 'LDR_HEALED'
east1            16Dec2025_22:55:29.842 (     6171291648) INFO     *rkr.mqbc.storagemanager mqbc_storagemanager.cpp:508 Cluster (itCluster) Partition [0]: Self Transition to Primary in the Partition FSM.
east1            16Dec2025_22:55:29.848 (     6171291648) INFO     *qbrkr.mqbc.partitionfsm mqbc_partitionfsm.cpp:78  Partition FSM for Partition [0] on Event 'DETECT_SELF_PRIMARY', transition: State 'UNKNOWN' =>  State 'PRIMARY_HEALING_STG1'

@kaikulimu kaikulimu requested a review from a team as a code owner September 30, 2025 19:26
@kaikulimu kaikulimu changed the title Fix mqbc: Cluster FSM must heal before starting Partition FSMs WIP Fix mqbc: Cluster FSM must heal before starting Partition FSMs Sep 30, 2025
@kaikulimu kaikulimu self-assigned this Sep 30, 2025
@kaikulimu kaikulimu force-pushed the qkeyinfomap-race branch 2 times, most recently from b2f0ae3 to 490d548 Compare October 17, 2025 21:01
@kaikulimu kaikulimu force-pushed the qkeyinfomap-race branch 3 times, most recently from ae6a352 to b5df771 Compare November 21, 2025 19:31
@kaikulimu kaikulimu changed the title WIP Fix mqbc: Cluster FSM must heal before starting Partition FSMs Fix mqbc: Cluster FSM must heal before starting Partition FSMs Dec 16, 2025
@kaikulimu kaikulimu force-pushed the qkeyinfomap-race branch 4 times, most recently from 965868e to f9d52c6 Compare December 20, 2025 01:35
@kaikulimu kaikulimu requested a review from 678098 December 20, 2025 02:32
@kaikulimu kaikulimu assigned 678098 and unassigned kaikulimu Dec 20, 2025
Copy link

@bmq-oss-ci bmq-oss-ci bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Build 3173 of commit f9d52c6 has completed with FAILURE

@kaikulimu kaikulimu assigned dorjesinpo and unassigned 678098 Dec 29, 2025
@kaikulimu kaikulimu removed the request for review from 678098 December 29, 2025 22:24
Copy link
Collaborator

@dorjesinpo dorjesinpo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we call initializeQueueKeyInfoMap (unconditionally) before any FSM event (and not as FSM event) to avoid any possibility for a race?

mqbs::FileStore* fs = d_fileStores[partitionId].get();
PartitionInfo& pinfo = d_partitionInfoVec[partitionId];

if (primary != pinfo.primary()) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There seems to be the same check in mqbc::StorageUtil::clearPrimaryForPartition already (without logging)

Copy link
Collaborator Author

@kaikulimu kaikulimu Jan 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same check used to be in mqbc::StorageUtil::clearPrimaryForPartition, but I removed it because I want to call this util method during mqbc::StorageManager::processShutdownEventDispatched and mqbc::StorageManager::stop, neither of which has easy access to mqbnet::ClusterNode* primary.

Thus, I moved the check upwards to mqbblp::StorageManager::clearPrimaryForPartitionDispatched. And you reminded me to do the same check inside mqbc::StorageManager::clearPrimaryForPartitionDispatched. Will add.

bsl::vector<PartitionFSMEventData> >
EventWithData;

struct PartitionFSMArgs {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good simplification. Can we make the same for the Cluster FSM?

Copy link
Collaborator Author

@kaikulimu kaikulimu Jan 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, similar to Partition FSM, we should have a singleton internal queue to process FSM events, instead of creating a new event queue every time and passing it as FSM args. Will make the change.

queueSp));
mqbs::FileStore* fs = d_fileStores[partitionId].get();
BSLS_ASSERT_SAFE(fs);
if (fs->inDispatcherThread()) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any motivation behind the inDispatcherThread() check other than optimization? Does it fix some race, if so which one?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before my PR, dispatchEventToPartition was only executed in the cluster dispatcher thread. I extended the method to be possibly executed by either cluster or queue dispatcher thread. I added this inDispatcherThread() check mostly for optimization, and to prevent unexpected logic when the PFSM is enqueueing events -- dispatching an event from queue thread to itself introduces a gap and is prone to race conditions. I did not encounter a specific bug while developing the code, but I was following a similar code change by Alex in #929 in which he did report to fix a race condition. So I would say executing in place is more thread-safe in general.

BSLS_ASSERT_SAFE(d_dispatcher_p->inDispatcherThread(d_cluster_p));

if (d_isQueueKeyInfoMapVecInitialized) {
BALL_LOG_WARN << d_clusterData_p->identity().description()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why removing this warning? Should it be an assertion?

Copy link
Collaborator Author

@kaikulimu kaikulimu Jan 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

initializeQueueKeyInfoMap needs to be called exactly once when the Cluster FSM becomes healed for the first time (either here or here), and before Partition FSMs are jumpstarted. If there is a change in leader, Cluster FSM goes through unknown to healed cycle again, but this time initializeQueueKeyInfoMap should not be called. The d_isQueueKeyInfoMapVecInitialized flag prevents the method from being called again. Therefore, we have no need to warn or alarm if we return early -- it simply indicates a change in leader.

}

if (primaryNode->nodeId() ==
d_clusterData_p->membership().selfNode()->nodeId()) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should mqbnet::ClusterImp::d_selfNodeId be const?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed mqbnet::ClusterImp::d_name to const too


dispatchEventToPartition(fs,
PartitionFSM::Event::e_DETECT_SELF_PRIMARY,
dispatchEventToPartition(PartitionFSM::Event::e_DETECT_SELF_PRIMARY,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like e_DETECT_SELF_PRIMARY can be enqueued twice (depending on the race). By StorageManager::initializeQueueKeyInfoMap and by setPrimaryForPartitionDispatched.
If so, this is undesirable.

Same about e_DETECT_SELF_REPLICA

Copy link
Collaborator Author

@kaikulimu kaikulimu Jan 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an important design decision I made for this PR, and please suggest alternatives if you think there is a better solution. processPrimaryDetect/processReplicaDetect triggers the healing process in Partition FSMs by enqueueing an e_DETECT_SELF_PRIMARY/e_DETECT_SELF_REPLICA event. processPrimaryDetect/processReplicaDetect is called during setPrimaryForPartitionDispatched, which happens whenever we commit a CSL advisory. The first CSL advisory commit transitions us to a healed leader/follower, and at this time we must do initializeQueueKeyInfoMap before jumpstarting the Partition FSMs -- this is the race you reported that I am trying to fix. Therefore, this PR's implementation blocks processPrimaryDetect/processReplicaDetect from being called during setPrimaryForPartitionDispatched if d_isQueueKeyInfoMapVecInitialized is false. Instead, initializeQueueKeyInfoMap calls processPrimaryDetect/processReplicaDetect near the end of the method. Future calls of setPrimaryForPartitionDispatched will not be blocked by d_isQueueKeyInfoMapVecInitialized and thus will trigger the Partition FSMs normally.

@dorjesinpo dorjesinpo assigned kaikulimu and unassigned dorjesinpo Jan 5, 2026
@kaikulimu
Copy link
Collaborator Author

Can't we call initializeQueueKeyInfoMap (unconditionally) before any FSM event (and not as FSM event) to avoid any possibility for a race?

initializeQueueKeyInfoMap copies information from the cluster state into d_queueKeyInfoMapVec so that partition threads can access those information in a thread-safe manner. It must be called after Cluster FSM is healed to ensure that the cluster state is up-to-date, and before Partition FSMs attempt to open file store. Hence, we can't really call it before any FSM event.

Signed-off-by: Yuan Jing Vincent Yan <[email protected]>
Copy link

@bmq-oss-ci bmq-oss-ci bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Build 3180 of commit 038355a has completed with FAILURE

@kaikulimu kaikulimu assigned dorjesinpo and unassigned kaikulimu Jan 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants