Using instance barrier inside fserver.init by fengidri · Pull Request #12 · stepfun-ai/StepMesh

fengidri · 2025-08-06T05:38:05Z

In my test, there are two machines, so one worker and one server, with a
group size of 8. There is always a deadlock occurring on worker instance
0. This happens because during fserver init, both the worker and the
server execute a barrier, but this barrier is not at the instance level.
Therefore, in my scenario, as soon as any two instances (either server
or worker) reach the barrier, those two instances can proceed further. I
think this behavior is not reasonable. The cause of the deadlock lies in
tests/benchmark/bmk_comm_latency_multiserver.py, where all workers also
call f.barrier(True, True), which may lead to the following situation.

* worker1-barrier-req
* server1-barrier-req
* worker1-barrier-done
* server1-barrier-done

* worker2-barrier-req
* server2-barrier-req
* worker2-barrier-done
* server2-barrier-done

* worker3-barrier-req
* server3-barrier-req
* worker3-barrier-done
* server3-barrier-done

* worker4-barrier-req
* server4-barrier-req
* worker4-barrier-done
* server4-barrier-done

* worker5-barrier-req
* server5-barrier-req
* worker5-barrier-done
* server5-barrier-done

* worker6-barrier-req
* server6-barrier-req
* worker6-barrier-done
* server6-barrier-done

* worker7-barrier-req
* server7-barrier-req
* worker7-barrier-done
* server7-barrier-done

* server0-barrier-req

worker0 has not send the barrier req. so the server0 is waiting.

Because the worker1-7 coimpleted the barrier inside fserver.init,
then they goto the f.barrier(True, True) inside bmk_comm_latency_multiserver.py.

* worker1-barrier-req
* worker2-barrier-req
* worker3-barrier-req
* worker4-barrier-req
* worker5-barrier-req
* worker6-barrier-req
* worker7-barrier-req

So there are 8 barrier request and there are two groups, all the reqs meet the
barrier condition.

* server0-barrier-done
* worker1-barrier-done
* worker2-barrier-done
* worker3-barrier-done
* worker4-barrier-done
* worker5-barrier-done
* worker6-barrier-done
* worker7-barrier-done

Now the worker0 send the barrier req from fserver.init(), so the worker0
is hang.

* worker0-barrier-req (wait.......)

The issue here arises from the barrier call in the benchmark, which
causes a conflict with the existing barrier. However, I believe the
barrier in fserver.init should be at the instance level, rather than at
the group level, as this would be safer and more reliable. Therefore,
this patch changes the barrier in fserver.init to be instance-level.

In my test, there are two machines, so one worker and one server, with a group size of 8. There is always a deadlock occurring on worker instance 0. This happens because during fserver init, both the worker and the server execute a barrier, but this barrier is not at the instance level. Therefore, in my scenario, as soon as any two instances (either server or worker) reach the barrier, those two instances can proceed further. I think this behavior is not reasonable. The cause of the deadlock lies in tests/benchmark/bmk_comm_latency_multiserver.py, where all workers also call f.barrier(True, True), which may lead to the following situation. * worker1-barrier-req * server1-barrier-req * worker1-barrier-done * server1-barrier-done * worker2-barrier-req * server2-barrier-req * worker2-barrier-done * server2-barrier-done * worker3-barrier-req * server3-barrier-req * worker3-barrier-done * server3-barrier-done * worker4-barrier-req * server4-barrier-req * worker4-barrier-done * server4-barrier-done * worker5-barrier-req * server5-barrier-req * worker5-barrier-done * server5-barrier-done * worker6-barrier-req * server6-barrier-req * worker6-barrier-done * server6-barrier-done * worker7-barrier-req * server7-barrier-req * worker7-barrier-done * server7-barrier-done * server0-barrier-req worker0 has not send the barrier req. so the server0 is waiting. Because the worker1-7 coimpleted the barrier inside fserver.init, then they goto the f.barrier(True, True) inside bmk_comm_latency_multiserver.py. * worker1-barrier-req * worker2-barrier-req * worker3-barrier-req * worker4-barrier-req * worker5-barrier-req * worker6-barrier-req * worker7-barrier-req So there are 8 barrier request and there are two groups, all the reqs meet the barrier condition. * server0-barrier-done * worker1-barrier-done * worker2-barrier-done * worker3-barrier-done * worker4-barrier-done * worker5-barrier-done * worker6-barrier-done * worker7-barrier-done Now the worker0 send the barrier req from fserver.init(), so the worker0 is hang. * worker0-barrier-req (wait.......) The issue here arises from the barrier call in the benchmark, which causes a conflict with the existing barrier. However, I believe the barrier in fserver.init should be at the instance level, rather than at the group level, as this would be safer and more reliable. Therefore, this patch changes the barrier in fserver.init to be instance-level. Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>

Here we are calling barrier in the worker's context, so we should not set include_server true, because only workers will execute this code. Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>

niehao100 · 2025-08-07T03:47:37Z

Hi, thanks for your testing.

In our test, a global barrier is required for mult A/F test before data transfer. We do not use instance barrier as there are inner node collective communications in both A/F.

It seems the barrier in this pr is instance-barrier -> worker group barrier. It is not equal to global barrier. If global barrier is faced with network problem(like rail optimized design), run inner node barrier for every node -> instance-barrier should work.

We can't reproduce this issue in our testbed, but it seems the barrier code in server side is lost.
Adding another f.barrier(True, True) to the server side at line bmk_comm_latency_multiserver.py:205 maybe helpful.

cc @zhouyu-sunny

fengidri · 2025-08-07T06:18:29Z

What do you mean by "global barrier"? I feel that the current barrier functionality in stepmesh, if not relying on external collective communication operations, will have many issues when used. Adding an f.barrier in bmk_comm_latency_multiserver.py might indeed solve the problem. However, if a user simply wants to synchronize workers, that requirement is perfectly reasonable. My test only runs this script, which does not involve any collective communication. I'm not sure why, but in my environment, worker 0 is slower than the others when executing SetDevice in init(). You could try adding a sleep specifically for worker 0 here to test this.

niehao100 · 2025-08-07T06:31:22Z

The global barrier means barrier among all servers and workers.
I misunderstand the usage of instrance_barrier.
Thanks for you insight, I will test the sleep case.

…Add barrier in server side in bmk

niehao100 · 2025-08-07T07:44:03Z

@fengidri Thanks for you test, I've reproduce this bug.

LGTM cc @zhouyu-sunny

fengidri added 2 commits August 6, 2025 13:22

benchmark: do not include server for f.barrier

6c458e7

Here we are calling barrier in the worker's context, so we should not set include_server true, because only workers will execute this code. Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>

Adjust the barrier api, change default mode to all instance barrier; …

0e8ad0b

…Add barrier in server side in bmk

niehao100 merged commit bea733d into stepfun-ai:main Aug 7, 2025
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using instance barrier inside fserver.init#12

Using instance barrier inside fserver.init#12
niehao100 merged 3 commits intostepfun-ai:mainfrom
fengidri:main

fengidri commented Aug 6, 2025

Uh oh!

niehao100 commented Aug 7, 2025 •

edited

Loading

Uh oh!

fengidri commented Aug 7, 2025

Uh oh!

niehao100 commented Aug 7, 2025 •

edited

Loading

Uh oh!

niehao100 commented Aug 7, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

fengidri commented Aug 6, 2025

Uh oh!

niehao100 commented Aug 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fengidri commented Aug 7, 2025

Uh oh!

niehao100 commented Aug 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

niehao100 commented Aug 7, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

niehao100 commented Aug 7, 2025 •

edited

Loading

niehao100 commented Aug 7, 2025 •

edited

Loading