Using instance barrier inside fserver.init#12
Conversation
In my test, there are two machines, so one worker and one server, with a group size of 8. There is always a deadlock occurring on worker instance 0. This happens because during fserver init, both the worker and the server execute a barrier, but this barrier is not at the instance level. Therefore, in my scenario, as soon as any two instances (either server or worker) reach the barrier, those two instances can proceed further. I think this behavior is not reasonable. The cause of the deadlock lies in tests/benchmark/bmk_comm_latency_multiserver.py, where all workers also call f.barrier(True, True), which may lead to the following situation. * worker1-barrier-req * server1-barrier-req * worker1-barrier-done * server1-barrier-done * worker2-barrier-req * server2-barrier-req * worker2-barrier-done * server2-barrier-done * worker3-barrier-req * server3-barrier-req * worker3-barrier-done * server3-barrier-done * worker4-barrier-req * server4-barrier-req * worker4-barrier-done * server4-barrier-done * worker5-barrier-req * server5-barrier-req * worker5-barrier-done * server5-barrier-done * worker6-barrier-req * server6-barrier-req * worker6-barrier-done * server6-barrier-done * worker7-barrier-req * server7-barrier-req * worker7-barrier-done * server7-barrier-done * server0-barrier-req worker0 has not send the barrier req. so the server0 is waiting. Because the worker1-7 coimpleted the barrier inside fserver.init, then they goto the f.barrier(True, True) inside bmk_comm_latency_multiserver.py. * worker1-barrier-req * worker2-barrier-req * worker3-barrier-req * worker4-barrier-req * worker5-barrier-req * worker6-barrier-req * worker7-barrier-req So there are 8 barrier request and there are two groups, all the reqs meet the barrier condition. * server0-barrier-done * worker1-barrier-done * worker2-barrier-done * worker3-barrier-done * worker4-barrier-done * worker5-barrier-done * worker6-barrier-done * worker7-barrier-done Now the worker0 send the barrier req from fserver.init(), so the worker0 is hang. * worker0-barrier-req (wait.......) The issue here arises from the barrier call in the benchmark, which causes a conflict with the existing barrier. However, I believe the barrier in fserver.init should be at the instance level, rather than at the group level, as this would be safer and more reliable. Therefore, this patch changes the barrier in fserver.init to be instance-level. Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Here we are calling barrier in the worker's context, so we should not set include_server true, because only workers will execute this code. Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
|
Hi, thanks for your testing. In our test, a global barrier is required for mult A/F test before data transfer. We do not use instance barrier as there are inner node collective communications in both A/F. It seems the barrier in this pr is instance-barrier -> worker group barrier. It is not equal to global barrier. If global barrier is faced with network problem(like rail optimized design), run inner node barrier for every node -> instance-barrier should work. We can't reproduce this issue in our testbed, but it seems the barrier code in server side is lost. |
|
What do you mean by "global barrier"? I feel that the current barrier functionality in stepmesh, if not relying on external collective communication operations, will have many issues when used. Adding an f.barrier in bmk_comm_latency_multiserver.py might indeed solve the problem. However, if a user simply wants to synchronize workers, that requirement is perfectly reasonable. My test only runs this script, which does not involve any collective communication. I'm not sure why, but in my environment, worker 0 is slower than the others when executing SetDevice in init(). You could try adding a sleep specifically for worker 0 here to test this. |
|
The global barrier means barrier among all servers and workers. |
…Add barrier in server side in bmk
|
@fengidri Thanks for you test, I've reproduce this bug. LGTM cc @zhouyu-sunny |
In my test, there are two machines, so one worker and one server, with a
group size of 8. There is always a deadlock occurring on worker instance
0. This happens because during fserver init, both the worker and the
server execute a barrier, but this barrier is not at the instance level.
Therefore, in my scenario, as soon as any two instances (either server
or worker) reach the barrier, those two instances can proceed further. I
think this behavior is not reasonable. The cause of the deadlock lies in
tests/benchmark/bmk_comm_latency_multiserver.py, where all workers also
call f.barrier(True, True), which may lead to the following situation.
The issue here arises from the barrier call in the benchmark, which
causes a conflict with the existing barrier. However, I believe the
barrier in fserver.init should be at the instance level, rather than at
the group level, as this would be safer and more reliable. Therefore,
this patch changes the barrier in fserver.init to be instance-level.