-
Notifications
You must be signed in to change notification settings - Fork 3.5k
Open
Labels
Description
Hi,
I added some code in training.py file to inject fault as below. And add the config --inprocess-restart.
`
if iteration == 5 and torch.distributed.get_rank() == 2:
from nvidia_resiliency_ext.inprocess.tools.inject_fault import inject_fault, Fault
inject_fault((Fault.GPU_SLEEP), 4, 1, 30, 1)
`
I saw that when my training work run to the 5th iteration, the process-restart will make the training work run from the first iteration again, without freeing the GPU memory.
So my question is:
- Did i use process-restart right?
- Why process-restart not release the previous memory . If the training work meet out of memory error, the training work will fall into a infinite loop.
- If the training work run from the first iteration without saving the ckpt. This will waste previous training results?