Device failure w/ transformer training using DDP/multiple GPU devices #20519
Unanswered
jacksettles
asked this question in
DDP / multi-GPU / multi-node
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
I am working on a project training a transformer language model of about 1.2B parameters on a small-ish dataset of about 80M words. That is small compared to industry standards, but big for this project at least. The issue I keep running into is device failure related. In regular PyTorch DDP one device would always get ahead of the others, causing a hang-up and the job would fail. Now I am using PyTorch Lightning, but it seems like the same issue is happening.
I was able to train this model on about 30M words with one GPU, but it didn't get very far (maybe about 10-15 epochs). Since I am using more data, I want to use multiple GPUs to speed up training time. I have tried PyTorch with DDP, and now I am on PyTorch Lightning with ddp as my strategy. I have access to 4 Nvidia A100 GPUs with my school computing cluster. I am submitting the jobs with SLURM. I am trying to figure out the right setup, because as of now the job keeps failing after a few hours. It looks like a memory issue, but I know the GPUs' memory isn't fully being used.
I am currently using a batch size of 16 sentences, but I was wondering if that is too small. When I run nvidia-smi, I can see that the GPUs are not being fully utilized in terms of memory, so I was wondering if increasing the batch size might help by reducing I/O operations with fewer batches. I was also wondering if using more workers in my dataloader might help. The max number of CPUs I can have for a 1 node job is 64, so if I used 4 GPUs, then I can only have 16 CPUs per task (i.e. per GPU device), and that didn't seem to work. I also tried using 2 GPUs with 32 CPUs per task, and I gave 28 workers to the dataloader, but it still died on me.
I can't seem to figure out the right configuration. Is my model size plus data size too big for a 1 node job? Are there any hyperparameters or special args I don't know about that may help in this scenario? Any help it greatly appreciated!!
Beta Was this translation helpful? Give feedback.
All reactions