Release Notes
NVIDIA Resiliency Extension is a Python package for framework developers and users to implement fault-tolerant features. It improves effective training time by minimizing downtime due to failures and interruptions.
NVIDIA Resiliency Extension v0.4.1
Highlights
This hotfix release includes important bug fixes, performance improvements, and minor updates to enhance stability.
-
Checkpointing
- PR 104, PR 106, PR 108, PR 111 and PR 116 fix the asynchronous checkpointing module to switch from temporal to using the persistent worker that uses
spawninstead offork. - The fix in this release is working toward an intermediate milestone of deprecating the use of
forkand instead using aspawnfor asynchronous checkpointing. The complete transition to usingspawnhas the following dependencies onforkthat will be eliminated in upcoming release:- Local checkpointing must continue to use the
forkbased asynchronous checkpointing as clarified in the usage guide. - File IO operations with multiprocessing can still trigger a
fork
- Local checkpointing must continue to use the
- PR 104, PR 106, PR 108, PR 111 and PR 116 fix the asynchronous checkpointing module to switch from temporal to using the persistent worker that uses
-
In-process restart
- PR 103 fixes a case where extra CUDA contexts were created on local rank 0 after restart, consuming extra GPU memory on local rank 0.
- PR 112 fixes the workload state leaks across the restart boundary. The fix addresses a case where objects created in the wrapped function could not be garbage collected after a restart, manifesting as a memory leak.
Known Issues & Limitations
- In a future release, we will add changes to automatically terminate the persistent process when the main process terminates.
- Until this change is implemented, job schedulers must ensure proper termination of the persistent process and its child workers for a graceful shutdown.