Skip to content

Using instance barrier inside fserver.init#12

Merged
niehao100 merged 3 commits intostepfun-ai:mainfrom
fengidri:main
Aug 7, 2025
Merged

Using instance barrier inside fserver.init#12
niehao100 merged 3 commits intostepfun-ai:mainfrom
fengidri:main

Conversation

@fengidri
Copy link
Contributor

@fengidri fengidri commented Aug 6, 2025

In my test, there are two machines, so one worker and one server, with a
group size of 8. There is always a deadlock occurring on worker instance
0. This happens because during fserver init, both the worker and the
server execute a barrier, but this barrier is not at the instance level.
Therefore, in my scenario, as soon as any two instances (either server
or worker) reach the barrier, those two instances can proceed further. I
think this behavior is not reasonable. The cause of the deadlock lies in
tests/benchmark/bmk_comm_latency_multiserver.py, where all workers also
call f.barrier(True, True), which may lead to the following situation.

* worker1-barrier-req
* server1-barrier-req
* worker1-barrier-done
* server1-barrier-done

* worker2-barrier-req
* server2-barrier-req
* worker2-barrier-done
* server2-barrier-done

* worker3-barrier-req
* server3-barrier-req
* worker3-barrier-done
* server3-barrier-done

* worker4-barrier-req
* server4-barrier-req
* worker4-barrier-done
* server4-barrier-done

* worker5-barrier-req
* server5-barrier-req
* worker5-barrier-done
* server5-barrier-done

* worker6-barrier-req
* server6-barrier-req
* worker6-barrier-done
* server6-barrier-done

* worker7-barrier-req
* server7-barrier-req
* worker7-barrier-done
* server7-barrier-done

* server0-barrier-req

worker0 has not send the barrier req. so the server0 is waiting.

Because the worker1-7 coimpleted the barrier inside fserver.init,
then they goto the f.barrier(True, True) inside bmk_comm_latency_multiserver.py.

* worker1-barrier-req
* worker2-barrier-req
* worker3-barrier-req
* worker4-barrier-req
* worker5-barrier-req
* worker6-barrier-req
* worker7-barrier-req

So there are 8 barrier request and there are two groups, all the reqs meet the
barrier condition.

* server0-barrier-done
* worker1-barrier-done
* worker2-barrier-done
* worker3-barrier-done
* worker4-barrier-done
* worker5-barrier-done
* worker6-barrier-done
* worker7-barrier-done

Now the worker0 send the barrier req from fserver.init(), so the worker0
is hang.

* worker0-barrier-req (wait.......)

The issue here arises from the barrier call in the benchmark, which
causes a conflict with the existing barrier. However, I believe the
barrier in fserver.init should be at the instance level, rather than at
the group level, as this would be safer and more reliable. Therefore,
this patch changes the barrier in fserver.init to be instance-level.

In my test, there are two machines, so one worker and one server, with a
group size of 8. There is always a deadlock occurring on worker instance
0. This happens because during fserver init, both the worker and the
server execute a barrier, but this barrier is not at the instance level.
Therefore, in my scenario, as soon as any two instances (either server
or worker) reach the barrier, those two instances can proceed further. I
think this behavior is not reasonable. The cause of the deadlock lies in
tests/benchmark/bmk_comm_latency_multiserver.py, where all workers also
call f.barrier(True, True), which may lead to the following situation.

* worker1-barrier-req
* server1-barrier-req
* worker1-barrier-done
* server1-barrier-done

* worker2-barrier-req
* server2-barrier-req
* worker2-barrier-done
* server2-barrier-done

* worker3-barrier-req
* server3-barrier-req
* worker3-barrier-done
* server3-barrier-done

* worker4-barrier-req
* server4-barrier-req
* worker4-barrier-done
* server4-barrier-done

* worker5-barrier-req
* server5-barrier-req
* worker5-barrier-done
* server5-barrier-done

* worker6-barrier-req
* server6-barrier-req
* worker6-barrier-done
* server6-barrier-done

* worker7-barrier-req
* server7-barrier-req
* worker7-barrier-done
* server7-barrier-done

* server0-barrier-req

worker0 has not send the barrier req. so the server0 is waiting.

Because the worker1-7 coimpleted the barrier inside fserver.init,
then they goto the f.barrier(True, True) inside bmk_comm_latency_multiserver.py.

* worker1-barrier-req
* worker2-barrier-req
* worker3-barrier-req
* worker4-barrier-req
* worker5-barrier-req
* worker6-barrier-req
* worker7-barrier-req

So there are 8 barrier request and there are two groups, all the reqs meet the
barrier condition.

* server0-barrier-done

* worker1-barrier-done
* worker2-barrier-done
* worker3-barrier-done
* worker4-barrier-done
* worker5-barrier-done
* worker6-barrier-done
* worker7-barrier-done

Now the worker0 send the barrier req from fserver.init(), so the worker0
is hang.

* worker0-barrier-req (wait.......)

The issue here arises from the barrier call in the benchmark, which
causes a conflict with the existing barrier. However, I believe the
barrier in fserver.init should be at the instance level, rather than at
the group level, as this would be safer and more reliable. Therefore,
this patch changes the barrier in fserver.init to be instance-level.

Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Here we are calling barrier in the worker's context, so we should not
set include_server true, because only workers will execute this code.

Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
@niehao100
Copy link
Collaborator

niehao100 commented Aug 7, 2025

Hi, thanks for your testing.

In our test, a global barrier is required for mult A/F test before data transfer. We do not use instance barrier as there are inner node collective communications in both A/F.

It seems the barrier in this pr is instance-barrier -> worker group barrier. It is not equal to global barrier. If global barrier is faced with network problem(like rail optimized design), run inner node barrier for every node -> instance-barrier should work.

We can't reproduce this issue in our testbed, but it seems the barrier code in server side is lost.
Adding another f.barrier(True, True) to the server side at line bmk_comm_latency_multiserver.py:205 maybe helpful.

cc @zhouyu-sunny

@fengidri
Copy link
Contributor Author

fengidri commented Aug 7, 2025

What do you mean by "global barrier"? I feel that the current barrier functionality in stepmesh, if not relying on external collective communication operations, will have many issues when used. Adding an f.barrier in bmk_comm_latency_multiserver.py might indeed solve the problem. However, if a user simply wants to synchronize workers, that requirement is perfectly reasonable. My test only runs this script, which does not involve any collective communication. I'm not sure why, but in my environment, worker 0 is slower than the others when executing SetDevice in init(). You could try adding a sleep specifically for worker 0 here to test this.

@niehao100
Copy link
Collaborator

niehao100 commented Aug 7, 2025

The global barrier means barrier among all servers and workers.
I misunderstand the usage of instrance_barrier.
Thanks for you insight, I will test the sleep case.

@niehao100
Copy link
Collaborator

@fengidri Thanks for you test, I've reproduce this bug.

LGTM cc @zhouyu-sunny

@niehao100 niehao100 merged commit bea733d into stepfun-ai:main Aug 7, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants