Support barrier for in the run section

A request from our user:
> Can sky provide some lightweight alternative to `mpirun` that achieves the equivalent of “running certain command on all nodes of a cluster initiated from head node, and block until this finishes on all nodes before proceeding with the next bash command”?
Right now I think there is only a barrier between setup and run, but further coordination within run is not supported by sky. I’m hoping that we can get rid of the openmpi dependency, and one of the current usage pattern is in the sky run section do something like
```
if [ "${SKY_NODE_RANK}" == "0" ]; then
  for _ in `seq max_attempt`; do
    mpirun ... kill_dangling_processes_on_all_machines
    mpirun ... build_target_on_all_machines
    mpirun ... run_target_on_all_machines
  done
fi
```

<s>
A possible solution:

```
run: |
  for _ in `seq max_attempt`; do
    # kill_dangling_processes_on_all_machines
    sky_barrier
    # build_target_on_all_machines
    sky_barrier
    # run_target_on_all_machines
    sky_barrier
  done
```

The `sky_barrier` should be a bash function with the following behavior:
1. Create an indicator file `/tmp/{task_id}-node-{SKYPILOT_NODE_RANK}-stage-{s}` on the first node (head node of the task) in the `SKYPILOT_NODE_IPS`, where `s` is the counter within the current task.
2. Keep checking if there are `echo "$SKYPILOT_NODE_IPS" | wc -l` indicator files on the head node of the task, before we continue the execution.
</s>



The user further requested that the returncode is needed for the commands:
> to further clarify - ideally there can be a command somewhat like mpirun , where you call it from the head node like:
`mpirun some_command arg1 arg2 …` and the effect is that it will run `some_command arg1 arg2 …` on all the machines and wait until it finishes on all machines, and return code 0 only if all of them exited with code 0
I think just providing a sky_barrier won’t be enough since there still needs to be some way to initiate the commands from the head node - where the decision logic of what further commands to execute depends on the return code of previous commands.

A possible solution (confirmed by the user):
```
run: |
  if [ "${SKY_NODE_RANK}" == "0" ]; then
    for i in `seq max_retry`; do
      sky run_on_nodes command arg1 arg2 # exitcode 0 if succeeded; otherwise, nonzero.
      sky run_on_nodes command arg1 arg2
    done
  fi
```

implementation details:
1. We add the `run_on_nodes` in our clis but not expose it to the `sky --help`
2. For the `run_on_nodes`:
  a. check the environment variable `$SKYPILOT_NODE_RANK` to make sure it is only executed on the head node.
  b. execute the command on all of `SKYPILOT_NODE_IPS` with ssh.
  c. aggregate the exit codes from all the nodes.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support barrier for in the run section #1762

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Support barrier for in the run section #1762

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions