Spmd pre-training llama2 multi-machine training so slow?

spmd has a normal training speed using eight blocks on a single machine, but the communication overhead increases rapidly in the case of multiple machines
device is：
gpu：A100 * 8 * 2
spmd strategy is:
```
for name, param in model.named_parameters():
    shape = (num_devices,) + (1,) * (len(param.shape) - 1)
    mesh = xs.Mesh(device_ids, shape)
    xs.mark_sharding(param, mesh, range(len(param.shape)))
```
profile result is：
 
![image](https://github.com/pytorch/xla/assets/62137145/6cba5403-e5ae-44ba-9554-acfa922a2549)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spmd pre-training llama2 multi-machine training so slow? #6778

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development