Skip to content

NVIDIA Resiliency Extension v0.2.1

Choose a tag to compare

@srogawski-nvidia srogawski-nvidia released this 22 Feb 00:49
· 529 commits to main since this release

Release Notes

NVIDIA Resiliency Extension is a Python package for framework developers and users to implement fault-tolerant features. It improves effective training time by minimizing downtime due to failures and interruptions.

NVIDIA Resiliency Extension v0.2.1

Highlights

This release includes important bug fixes, performance improvements, and minor updates to enhance stability:

Build Fixes & Code Improvements

  • Fixed missing #include to ensure proper compilationv in pytorch:24.12-py3 container.
  • Lazy loading of cupti_module.
  • Fixed ForkingPickler to ensure proper installation in pytorch:25.01-py3 container

Contributors

@maanug-nv @jbieniusiewi @srogawski-nvidia