Skip to content

Commit 5f7b16f

Browse files
committed
Merge pull request #95 from anjalibshah/main
Release notes v0.4.0
1 parent 4673577 commit 5f7b16f

File tree

1 file changed

+26
-1
lines changed

1 file changed

+26
-1
lines changed

docs/source/release-notes.md

Lines changed: 26 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,31 @@
22

33
NVIDIA Resiliency Extension is a Python package for framework developers and users to implement fault-tolerant features. It improves effective training time by minimizing downtime due to failures and interruptions.
44

5+
## NVIDIA Resiliency Extension v0.4.0
6+
7+
### Highlights
8+
9+
- Checkpointing
10+
- [PR 29](https://github.com/NVIDIA/nvidia-resiliency-ext/pull/29) - Support for storing checkpoints to cloud object stores
11+
- Leverage cloud storage provider’s multithreaded SDK for rapid loading and saving checkpoints to object stores such as AWS S3, Azure Blob
12+
Storage, Google Cloud Storage and more using NVIDIA Multi-storage Client.
13+
- Provide scalable, reliable, cheaper, single source of truth across clouds/regions
14+
- Provide opt-out configuration when creating FileSystemWriterAsync class instance to allow users to passthrough to the filesystem
15+
- [PR 36](https://github.com/NVIDIA/nvidia-resiliency-ext/pull/36) - Critical bug fix to enable async checkpoint loading without errors
16+
17+
- In-process & In-job restart
18+
- [PR 35](https://github.com/NVIDIA/nvidia-resiliency-ext/pull/35) - Nested restarter updates for in-process restart to align with in-job
19+
restart, so users have a consistent experience across in-process and in-job restarts
20+
- Updates to in-process nested restart functionality provided by Python Wrapper class and existing callback infrastructure with additional
21+
callbacks and logging
22+
23+
### Known Issues & Limitations
24+
- Dependencies:
25+
- In-process requires Pytorch, at least [version](https://github.com/orgs/pytorch/packages/container/pytorch-nightly/398218496?tag=2.8.0.dev20250418-cuda12.6-cudnn9-devel), that includes changes in [PR 150690](https://github.com/pytorch/pytorch/pull/150690) to avoid
26+
deadlock in NCCL P2P communications (used in pipeline parallel)
27+
- In-process requires Transformer Engine including at least [PR 1715](https://github.com/NVIDIA/TransformerEngine/pull/1715) (merged) and [PR
28+
1812](https://github.com/NVIDIA/TransformerEngine/pull/1812) (not yet merged) to reduce cross-restart memory leaks
29+
530
## NVIDIA Resiliency Extension v0.3.0
631

732
### Highlights
@@ -16,7 +41,7 @@ NVIDIA Resiliency Extension is a Python package for framework developers and use
1641
- NIC
1742
- Checkpointing
1843
- Existing capabilities that used to be part of Megatron Core is refactored to be part of NVRx. The checkpointing feature will be maintained as part of NVRx, and Megatron Core and NeMo will use the code from NVRx in the future.
19-
- Added support for checkpoint metadata caching to improve performance for subsequent checkpoints
44+
- Added support for checkpoint metadata caching to improve performance for subsequent checkpoints.
2045

2146
### Known Issues & Limitations
2247

0 commit comments

Comments
 (0)