[QUESTION] inprocess restart

Hi,
I added some code in training.py file to inject fault as below. And add the config --inprocess-restart.

`

    if iteration == 5 and torch.distributed.get_rank() == 2:
        from nvidia_resiliency_ext.inprocess.tools.inject_fault import inject_fault, Fault
        inject_fault((Fault.GPU_SLEEP), 4, 1, 30, 1)

`



I saw that when my training work run to the 5th iteration, the process-restart will make the training work run from the first iteration again, without freeing the GPU memory.
So my question is:
1. Did i use process-restart right?
2.  Why process-restart not release the previous memory . If the training work meet out of memory error, the training work will fall into a infinite loop.
3. If the training work run from the first iteration without saving the ckpt. This will waste previous training results?



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[QUESTION] inprocess restart #2774

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[QUESTION] inprocess restart #2774

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions