Skip to content

Conversation

@hexinw-nvidia
Copy link
Contributor

.

hexinw-nvidia and others added 30 commits August 30, 2025 12:51
2) Fixed logger format for regular stderr/stdout handler.
3) world_local_tmp can be None if it is not specified in the ENV.
`TORCH_NCCL_BUFFER_SIZE` (< 2.8), `TORCH_FR_BUFFER_SIZE`(>=2.8) additionally
Fixes error propagation during checkpoint saving
by using torch's DistWrapper to send the exception
to the coordinator instead of killing the process
early. Also fixes call path to ensure necessary
operations happen in finally blocks.

Adds a test and slightly refactors the multiprocessing
invocation to allow overriding the open() function
with one that raises an exception. This enables the
test. In the future we will factor our the filesystem
operations into a delegate so it can be more easily
mocked.
Consider torch version for FR env variables
- Update async_ckpt.py example with improved functionality
- Update async_writer.py with MSC support and bug fixes
- Update local_ckpt.py example
- Update usage guide documentation
Make logging exhaustive UT optional
checkpointing: fix error propagation and add test
examples: update to add MSC support, fix multi-server support, and update docs
Use previous logging file when available
hexinw-nvidia and others added 22 commits October 7, 2025 16:15
fix: require explicit rdzv-endpoint for c10d backend
feat: Add infrastructure rank support and optimize section monitoring
fix(async_ckpt): prevent cross-call state pollution in AsyncRequest
feat: Flight recorder attribution module
…er_abort

Training exit after Inprocess abort should ensure clean shutdown of PersistentAsync worker
@hexinw-nvidia hexinw-nvidia added the ci-approved Approved to run CI label Oct 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-approved Approved to run CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants