Skip to content

v0.4.0

Choose a tag to compare

@hexinw-nvidia hexinw-nvidia released this 28 May 06:22
· 464 commits to main since this release

Release Notes

NVIDIA Resiliency Extension is a Python package for framework developers and users to implement fault-tolerant features. It improves effective training time by minimizing downtime due to failures and interruptions.

NVIDIA Resiliency Extension v0.4.0

Highlights

  • Checkpointing

    • PR 29 - Support for storing checkpoints to cloud object stores
      • Leverage cloud storage provider’s multithreaded SDK for rapid loading and saving checkpoints to object stores such as AWS S3, Azure Blob
        Storage, Google Cloud Storage and more using NVIDIA Multi-storage Client.
      • Provide scalable, reliable, cheaper, single source of truth across clouds/regions
      • Provide opt-out configuration when creating FileSystemWriterAsync class instance to allow users to passthrough to the filesystem
    • PR 36 - Critical bug fix to enable async checkpoint loading without errors
  • In-process & In-job restart

    • PR 35 - Nested restarter updates for in-process restart to align with in-job
      restart, so users have a consistent experience across in-process and in-job restarts
    • Updates to in-process nested restart functionality provided by Python Wrapper class and existing callback infrastructure with additional
      callbacks and logging

Known Issues & Limitations

  • Dependencies:
    • In-process requires Pytorch, at least version, that includes changes in PR 150690 to avoid
      deadlock in NCCL P2P communications (used in pipeline parallel)
    • In-process requires Transformer Engine including at least PR 1715 (merged) and PR
      1812
      (not yet merged) to reduce cross-restart memory leaks