NVIDIA Resiliency Extension v0.2.1
Release Notes
NVIDIA Resiliency Extension is a Python package for framework developers and users to implement fault-tolerant features. It improves effective training time by minimizing downtime due to failures and interruptions.
NVIDIA Resiliency Extension v0.2.1
Highlights
This release includes important bug fixes, performance improvements, and minor updates to enhance stability:
Build Fixes & Code Improvements
- Fixed missing #include to ensure proper compilationv in pytorch:24.12-py3 container.
- Lazy loading of cupti_module.
- Fixed ForkingPickler to ensure proper installation in pytorch:25.01-py3 container