v0.4.0
Release Notes
NVIDIA Resiliency Extension is a Python package for framework developers and users to implement fault-tolerant features. It improves effective training time by minimizing downtime due to failures and interruptions.
NVIDIA Resiliency Extension v0.4.0
Highlights
-
Checkpointing
- PR 29 - Support for storing checkpoints to cloud object stores
- Leverage cloud storage provider’s multithreaded SDK for rapid loading and saving checkpoints to object stores such as AWS S3, Azure Blob
Storage, Google Cloud Storage and more using NVIDIA Multi-storage Client. - Provide scalable, reliable, cheaper, single source of truth across clouds/regions
- Provide opt-out configuration when creating FileSystemWriterAsync class instance to allow users to passthrough to the filesystem
- Leverage cloud storage provider’s multithreaded SDK for rapid loading and saving checkpoints to object stores such as AWS S3, Azure Blob
- PR 36 - Critical bug fix to enable async checkpoint loading without errors
- PR 29 - Support for storing checkpoints to cloud object stores
-
In-process & In-job restart
- PR 35 - Nested restarter updates for in-process restart to align with in-job
restart, so users have a consistent experience across in-process and in-job restarts - Updates to in-process nested restart functionality provided by Python Wrapper class and existing callback infrastructure with additional
callbacks and logging
- PR 35 - Nested restarter updates for in-process restart to align with in-job
Known Issues & Limitations
- Dependencies: