Release Notes

NVIDIA Resiliency Extension is a Python package for framework developers and users to implement fault-tolerant features. It improves effective training time by minimizing downtime due to failures and interruptions.

NVIDIA Resiliency Extension v0.4.0

Highlights

Checkpointing
- PR 29 - Support for storing checkpoints to cloud object stores
  - Leverage cloud storage provider’s multithreaded SDK for rapid loading and saving checkpoints to object stores such as AWS S3, Azure Blob
    Storage, Google Cloud Storage and more using NVIDIA Multi-storage Client.
  - Provide scalable, reliable, cheaper, single source of truth across clouds/regions
  - Provide opt-out configuration when creating FileSystemWriterAsync class instance to allow users to passthrough to the filesystem
- PR 36 - Critical bug fix to enable async checkpoint loading without errors
In-process & In-job restart
- PR 35 - Nested restarter updates for in-process restart to align with in-job
  restart, so users have a consistent experience across in-process and in-job restarts
- Updates to in-process nested restart functionality provided by Python Wrapper class and existing callback infrastructure with additional
  callbacks and logging

Known Issues & Limitations

Dependencies:
- In-process requires Pytorch, at least version, that includes changes in PR 150690 to avoid
  deadlock in NCCL P2P communications (used in pipeline parallel)
- In-process requires Transformer Engine including at least PR 1715 (merged) and PR
  1812 (not yet merged) to reduce cross-restart memory leaks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

v0.4.0

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Release Notes

NVIDIA Resiliency Extension v0.4.0

Highlights

Known Issues & Limitations

Uh oh!