-
Notifications
You must be signed in to change notification settings - Fork 164
Fix mqbc: Cluster FSM must heal before starting Partition FSMs #951
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
b2f0ae3 to
490d548
Compare
ae6a352 to
b5df771
Compare
b5df771 to
789de0b
Compare
965868e to
f9d52c6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Build 3173 of commit f9d52c6 has completed with FAILURE
dorjesinpo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we call initializeQueueKeyInfoMap (unconditionally) before any FSM event (and not as FSM event) to avoid any possibility for a race?
| mqbs::FileStore* fs = d_fileStores[partitionId].get(); | ||
| PartitionInfo& pinfo = d_partitionInfoVec[partitionId]; | ||
|
|
||
| if (primary != pinfo.primary()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There seems to be the same check in mqbc::StorageUtil::clearPrimaryForPartition already (without logging)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same check used to be in mqbc::StorageUtil::clearPrimaryForPartition, but I removed it because I want to call this util method during mqbc::StorageManager::processShutdownEventDispatched and mqbc::StorageManager::stop, neither of which has easy access to mqbnet::ClusterNode* primary.
Thus, I moved the check upwards to mqbblp::StorageManager::clearPrimaryForPartitionDispatched. And you reminded me to do the same check inside mqbc::StorageManager::clearPrimaryForPartitionDispatched. Will add.
| bsl::vector<PartitionFSMEventData> > | ||
| EventWithData; | ||
|
|
||
| struct PartitionFSMArgs { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good simplification. Can we make the same for the Cluster FSM?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, similar to Partition FSM, we should have a singleton internal queue to process FSM events, instead of creating a new event queue every time and passing it as FSM args. Will make the change.
| queueSp)); | ||
| mqbs::FileStore* fs = d_fileStores[partitionId].get(); | ||
| BSLS_ASSERT_SAFE(fs); | ||
| if (fs->inDispatcherThread()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any motivation behind the inDispatcherThread() check other than optimization? Does it fix some race, if so which one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before my PR, dispatchEventToPartition was only executed in the cluster dispatcher thread. I extended the method to be possibly executed by either cluster or queue dispatcher thread. I added this inDispatcherThread() check mostly for optimization, and to prevent unexpected logic when the PFSM is enqueueing events -- dispatching an event from queue thread to itself introduces a gap and is prone to race conditions. I did not encounter a specific bug while developing the code, but I was following a similar code change by Alex in #929 in which he did report to fix a race condition. So I would say executing in place is more thread-safe in general.
| BSLS_ASSERT_SAFE(d_dispatcher_p->inDispatcherThread(d_cluster_p)); | ||
|
|
||
| if (d_isQueueKeyInfoMapVecInitialized) { | ||
| BALL_LOG_WARN << d_clusterData_p->identity().description() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why removing this warning? Should it be an assertion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
initializeQueueKeyInfoMap needs to be called exactly once when the Cluster FSM becomes healed for the first time (either here or here), and before Partition FSMs are jumpstarted. If there is a change in leader, Cluster FSM goes through unknown to healed cycle again, but this time initializeQueueKeyInfoMap should not be called. The d_isQueueKeyInfoMapVecInitialized flag prevents the method from being called again. Therefore, we have no need to warn or alarm if we return early -- it simply indicates a change in leader.
| } | ||
|
|
||
| if (primaryNode->nodeId() == | ||
| d_clusterData_p->membership().selfNode()->nodeId()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should mqbnet::ClusterImp::d_selfNodeId be const?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed mqbnet::ClusterImp::d_name to const too
|
|
||
| dispatchEventToPartition(fs, | ||
| PartitionFSM::Event::e_DETECT_SELF_PRIMARY, | ||
| dispatchEventToPartition(PartitionFSM::Event::e_DETECT_SELF_PRIMARY, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like e_DETECT_SELF_PRIMARY can be enqueued twice (depending on the race). By StorageManager::initializeQueueKeyInfoMap and by setPrimaryForPartitionDispatched.
If so, this is undesirable.
Same about e_DETECT_SELF_REPLICA
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an important design decision I made for this PR, and please suggest alternatives if you think there is a better solution. processPrimaryDetect/processReplicaDetect triggers the healing process in Partition FSMs by enqueueing an e_DETECT_SELF_PRIMARY/e_DETECT_SELF_REPLICA event. processPrimaryDetect/processReplicaDetect is called during setPrimaryForPartitionDispatched, which happens whenever we commit a CSL advisory. The first CSL advisory commit transitions us to a healed leader/follower, and at this time we must do initializeQueueKeyInfoMap before jumpstarting the Partition FSMs -- this is the race you reported that I am trying to fix. Therefore, this PR's implementation blocks processPrimaryDetect/processReplicaDetect from being called during setPrimaryForPartitionDispatched if d_isQueueKeyInfoMapVecInitialized is false. Instead, initializeQueueKeyInfoMap calls processPrimaryDetect/processReplicaDetect near the end of the method. Future calls of setPrimaryForPartitionDispatched will not be blocked by d_isQueueKeyInfoMapVecInitialized and thus will trigger the Partition FSMs normally.
|
912e372 to
c5414d5
Compare
Signed-off-by: Yuan Jing Vincent Yan <[email protected]>
Signed-off-by: Yuan Jing Vincent Yan <[email protected]>
Signed-off-by: Yuan Jing Vincent Yan <[email protected]>
Signed-off-by: Yuan Jing Vincent Yan <[email protected]>
Signed-off-by: Yuan Jing Vincent Yan <[email protected]>
Signed-off-by: Yuan Jing Vincent Yan <[email protected]>
Signed-off-by: Yuan Jing Vincent Yan <[email protected]>
Signed-off-by: Yuan Jing Vincent Yan <[email protected]>
c5414d5 to
51d45b3
Compare
Signed-off-by: Yuan Jing Vincent Yan <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Build 3180 of commit 038355a has completed with FAILURE
Today in the Cluster FSM, leader becomes healed upon the CSL commit success of the first leader advisory, where it will initialize the queue key info map on the cluster thread. Near identical logic exists at the follower node.
At the same time, CSL commit callback fires and triggers the
onPartitionPrimaryAssignmentobserver, which jumpstarts the Partition FSM. The Partition FSM will eventually attempt to open FileStore in the partition thread, which will access the queue key info map. There is a slight chance of race condition, so let's fix it.The fix consists of two parts:
do_storePartitionInfoanddo_clearPartitionInfoare removed.processPrimaryDetect/processReplicaDetectuntilinitializeQueueKeyInfoMaphas completed.After the fix, we can observe that the events happen in the correct order -- Cluster FSM becomes healed before any Partition FSM can start: