Skip to content

[BUG] deepspeed gets re-initialized 4x, causing CPU RAM to blow up and OOM using c10d/flyte #7127

Open
@j93hahn

Description

@j93hahn

Hi, I'm using deepspeed (version 0.15.1) to train models (and torch==2.3.1).

When using deepspeed, the expected behavior is to call deepspeed.init_distributed which will print out:

[2025-03-10 20:33:41,161] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-03-10 20:33:41,161] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-03-10 20:33:41,161] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-03-10 20:33:41,161] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-03-10 20:33:41,161] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-03-10 20:33:41,161] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-03-10 20:33:41,161] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-03-10 20:33:41,161] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)

on one node. This should print only once!!!

Now, when I use flyte training (which uses pytorch's c10d), I get this super strange phenomenon where deepspeed will re-initialize an extra 4 times (and print the above accelerator log 32 additional times per node or 4 additional times per GPU) right before the main training loop. This causes the CPU RAM utilization to balloon to over 1.5 TB at which point all my pods get OOM-killed.

I only know this happens on multi-node runs, I haven't tried single-node runs yet so don't know if this is reproducible there but that is not really a high priority since almost all of my runs need to be multi-node.

Interestingly, previously I always used etcd and never had this phenomenon. So clearly there is a code issue that triggers when we're using c10d, but not when we're using etcd. What could be the problem?

In the logs, I observe that deepspeed starts to re-initialize a whole bunch across all the nodes right around the very first torch.cuda.synchronize() call that we have. Should I ever call this function when using deepspeed? I use dist.barrier() a bunch which is absolutely necessary but I don't think we need to synchronize (as I believe deepspeed handles that logic internally). We do use CUDA timers to track down useful metrics regarding multiple parts of our code (like data loading, inference, etc. etc.) and that does require a synchronization step for the existing stream.

Any help here would be greatly appreciated!

For references about the hardware - I'm using multiple nodes of NVIDIA H200 GPUs with CUDA 12.8 installed

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingtraining

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions