Skip to content

v0.3.0

Choose a tag to compare

@srogawski-nvidia srogawski-nvidia released this 18 Mar 05:46
· 528 commits to main since this release

Release Notes

NVIDIA Resiliency Extension is a python package for framework developers and users to implement fault-tolerant features. It improves the effective training time by minimizing the downtime due to failures and interruptions.

NVIDIA Resiliency Extension v0.3

Highlights

  • Support for Blackwell GPU

  • ARM based host CPU support

  • In-process & In-job restart

    • Hierarchical in-process and in-job restart support
    • Warm spare support
  • Health checks

    • GPU health check based on NVML
    • NIC
  • Checkpointing

    • Existing capabilities that used to be part of Megatron Core is refactored to be part of NVRx. The checkpointing feature will be maintained as part of NVRx, and Megatron Core and NeMo will use the code from NVRx in the future.

Known Issues & Limitations

  • GPU health check requires driver >= 570
  • Checkpointing - Persistent queue with replication is not supported

Contributors

@apaithankar @grzegorz-k-karch @hexinw-nvidia @jbieniusiewi @j-szulc @mikolajblaz @sbak5 @skierat @srogawski-nvidia @szmigacz @yzhautouskay