-
Notifications
You must be signed in to change notification settings - Fork 378
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
torch.distributed.elastic.multiprocessing.errors.ChildFailedError #1710
Comments
@SagarChandra07 You are asking me to download some random file without any context. I really dont understand your stance here. I request the moderators to delete the comment. |
@Vattikondadheeraj yeah that error message is not too informative, even with distributed training I am used to seeing more. I guess you are getting SIGKILL signal, but I think that can be caused by multiple things. As a debugging step, I would suggest checking whether this succeeds if you run on single-device (can run with batch size 1 to hopefully avoid any OOMs). You can also log the memory stats or try to insert calls to Btw sorry about the spam comment, I've deleted it. |
@ebsmothers , I tried few experiments as you mentioned. The first one is I added the logging scripts as shown below
After adding them, it ran for one epoch and gave the above error.
When I added different print statements, I found out that its failing at optimizer.step function. Any tips for this error? |
Hey @ebsmothers , I have pinponted the error in more detailed. I have attached the torch.adamW file along with log file below
The log file is
|
As a shot in the dark, can you set dataset.packed=True, tokenizer.max_seq_len=4096 This will force the batches to be at this maximum size. If it breaks on the very first one, you know it’s a OOM problem. Then lower max seq Len until it runs without OOMing, and check if it breaks again |
@felipemello1 , I am curious whether adding dataset.packed=True will solve the main problem of multiprocessing fail? Because as i said the process is failing at optimizer.step() line, when I add the "torch.distributed.breakpoint()" and run it manually, its working fine but the problem is I need to press "n" everytime. And second thing is how long does the packing of dataset takes place? For 400 datapoints, tokenizer.max_seq_len=5000, my code was running for more than 30min and the packing is still happening. FYI, i changes the packed value to be true everywhere for now. Am I struck in some kind of loop? Because the packing is not ending despite waiting for a long time. I have attached a small snippet of my code below
|
@Vattikondadheeraj if nothing else seems to help you, you could try and install from source and repeat the run using my branch edit: branch was merged here, so just install from source if you want to try it :). |
@Vattikondadheeraj agreed, I don't think packing will solve this problem (or at least there's no obvious evidence that it will). I'm not sure why you're stuck in a loop.. at least on a dataset with 50k samples it usually takes 1-2 minutes. Anyways I would recommend following @mirceamironenco's suggestion since it will hopefully help you to better pinpoint an exact error. |
Hey @ebsmothers, small update from my side. As soon I changed the optimizer from AdamW to Adam, the training is working fine without any breaking points. I was unable to pinpoint the error but even with the clean version of torchtune (i.e cloning the repo and running the full_finetuning_distributed.py) was giving the same error with AdamW but it was working fine with Adam. So I think there might be some bug in torch optimizers. Let me know if I am missing something. |
Context :- I am trying to run distributed training on 2 A-100 gpus with 40GB of VRAM. The batch size is 3 and gradient accumulation=1. I have attached the config file below for more details and the error as well. I thing is I am not able to pinpoint the problem here because the error message itself is unclear. Is this because of CUDA memory issue? Or am I missing something else?
Config File :-
The text was updated successfully, but these errors were encountered: