[BUG] deepspeed gets re-initialized 4x, causing CPU RAM to blow up and OOM using c10d/flyte

Hi, I'm using deepspeed (version `0.15.1`) to train models (and `torch==2.3.1`).

When using deepspeed, the expected behavior is to call `deepspeed.init_distributed` which will print out:
```
[2025-03-10 20:33:41,161] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-03-10 20:33:41,161] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-03-10 20:33:41,161] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-03-10 20:33:41,161] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-03-10 20:33:41,161] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-03-10 20:33:41,161] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-03-10 20:33:41,161] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-03-10 20:33:41,161] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
```

on one node. This should print only once!!! 

Now, when I use flyte training (which uses pytorch's c10d), I get this super strange phenomenon where deepspeed will re-initialize an extra 4 times (and print the above accelerator log 32 additional times per node or 4 additional times per GPU) right before the main training loop. This causes the CPU RAM utilization to balloon to over 1.5 TB at which point all my pods get OOM-killed. 

I only know this happens on multi-node runs, I haven't tried single-node runs yet so don't know if this is reproducible there but that is not really a high priority since almost all of my runs need to be multi-node.

Interestingly, previously I always used etcd and never had this phenomenon. So clearly there is a code issue that triggers when we're using c10d, but not when we're using etcd. What could be the problem? 

In the logs, I observe that deepspeed starts to re-initialize a whole bunch across all the nodes right around the very first `torch.cuda.synchronize()` call that we have. Should I ever call this function when using deepspeed? I use `dist.barrier()` a bunch which is absolutely necessary but I don't think we need to synchronize (as I believe deepspeed handles that logic internally). We do use CUDA timers to track down useful metrics regarding multiple parts of our code (like data loading, inference, etc. etc.) and that does require a synchronization step for the existing stream.

Any help here would be greatly appreciated!

For references about the hardware - I'm using multiple nodes of NVIDIA H200 GPUs with CUDA 12.8 installed


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG] deepspeed gets re-initialized 4x, causing CPU RAM to blow up and OOM using c10d/flyte #7127

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG] deepspeed gets re-initialized 4x, causing CPU RAM to blow up and OOM using c10d/flyte #7127

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions