Skip to content

[QUESTION] inprocess restart #2774

@tendar

Description

@tendar

Hi,
I added some code in training.py file to inject fault as below. And add the config --inprocess-restart.

`

if iteration == 5 and torch.distributed.get_rank() == 2:
    from nvidia_resiliency_ext.inprocess.tools.inject_fault import inject_fault, Fault
    inject_fault((Fault.GPU_SLEEP), 4, 1, 30, 1)

`

I saw that when my training work run to the 5th iteration, the process-restart will make the training work run from the first iteration again, without freeing the GPU memory.
So my question is:

  1. Did i use process-restart right?
  2. Why process-restart not release the previous memory . If the training work meet out of memory error, the training work will fall into a infinite loop.
  3. If the training work run from the first iteration without saving the ckpt. This will waste previous training results?

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions