Skip to content

Ordered send does not work in some cases #219

@songweijia

Description

@songweijia

In one of the Cascade evaluation experiments, I need to issue an ordered send to ask all nodes in a shard to dump their timestamp logs. This mechanism was working properly in most cases. But in a special case where the containing subgroup is the second subgroup of the same type, the ordered send never finishes. Tracing into the execution using gdb, I found MulticastGroup::delivery_trigger() is not properly triggered in the predicated thread on the sender node. For the very first ordered send, testing shows that locally_stable_sst_messages[subgroup_num] is not empty and the first sequence read in least_undelivered_sst_seq_num is 0. However, the min_stable_num variable gets -1, unexpectedly excluding the code guarded by this condition. The message never gets delivered.

I'm not sure if this always happens for the subgroups which are not the first one of a subgroup type. It needs more investigation.

I've talked to Sagar about this since he knows the best of this part. He will investigate it later. I can find a workaround for my tests but it would be great if we fix this thoroughly.

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions