In one of the Cascade evaluation experiments, I need to issue an ordered send to ask all nodes in a shard to dump their timestamp logs. This mechanism was working properly in most cases. But in a special case where the containing subgroup is the second subgroup of the same type, the ordered send never finishes. Tracing into the execution using gdb, I found MulticastGroup::delivery_trigger() is not properly triggered in the predicated thread on the sender node. For the very first ordered send, testing shows that locally_stable_sst_messages[subgroup_num] is not empty and the first sequence read in least_undelivered_sst_seq_num is 0. However, the min_stable_num variable gets -1, unexpectedly excluding the code guarded by this condition. The message never gets delivered.
I'm not sure if this always happens for the subgroups which are not the first one of a subgroup type. It needs more investigation.
I've talked to Sagar about this since he knows the best of this part. He will investigate it later. I can find a workaround for my tests but it would be great if we fix this thoroughly.
In one of the Cascade evaluation experiments, I need to issue an ordered send to ask all nodes in a shard to dump their timestamp logs. This mechanism was working properly in most cases. But in a special case where the containing subgroup is the second subgroup of the same type, the ordered send never finishes. Tracing into the execution using gdb, I found
MulticastGroup::delivery_trigger()is not properly triggered in the predicated thread on the sender node. For the very first ordered send, testing shows thatlocally_stable_sst_messages[subgroup_num]is not empty and the first sequence read inleast_undelivered_sst_seq_numis 0. However, themin_stable_numvariable gets -1, unexpectedly excluding the code guarded by this condition. The message never gets delivered.I'm not sure if this always happens for the subgroups which are not the first one of a subgroup type. It needs more investigation.
I've talked to Sagar about this since he knows the best of this part. He will investigate it later. I can find a workaround for my tests but it would be great if we fix this thoroughly.