Slamp dump 0.5 from main #202

hexinw-nvidia · 2025-10-14T22:07:15Z

.

Update version to 0.5.0

2) Fixed logger format for regular stderr/stdout handler. 3) world_local_tmp can be None if it is not specified in the ENV.

InJob logger regression

`TORCH_NCCL_BUFFER_SIZE` (< 2.8), `TORCH_FR_BUFFER_SIZE`(>=2.8) additionally

Fixes error propagation during checkpoint saving by using torch's DistWrapper to send the exception to the coordinator instead of killing the process early. Also fixes call path to ensure necessary operations happen in finally blocks. Adds a test and slightly refactors the multiprocessing invocation to allow overriding the open() function with one that raises an exception. This enables the test. In the future we will factor our the filesystem operations into a delegate so it can be more easily mocked.

Consider torch version for FR env variables

- Update async_ckpt.py example with improved functionality - Update async_writer.py with MSC support and bug fixes - Update local_ckpt.py example - Update usage guide documentation

Make logging exhaustive UT optional

…ior like dist and non-dist CP

checkpointing: fix error propagation and add test

examples: update to add MSC support, fix multi-server support, and update docs

…ior like dist and non-dist CP

…nality

Use previous logging file when available

fix: require explicit rdzv-endpoint for c10d backend

feat: Add infrastructure rank support and optimize section monitoring

fix(async_ckpt): prevent cross-call state pollution in AsyncRequest

…gress of the workload

…bution and compare output with reference

feat: Flight recorder attribution module

…ersistent async workers

…er_abort Training exit after Inprocess abort should ensure clean shutdown of PersistentAsync worker

hexinw-nvidia and others added 30 commits August 30, 2025 12:51

Merge pull request #159 from NVIDIA/update-v0.5-release-version

e577a8a

Update version to 0.5.0

1) Added PyPI installation usage instruction on Log Aggreator.

6205a33

2) Fixed logger format for regular stderr/stdout handler. 3) world_local_tmp can be None if it is not specified in the ENV.

InJob NestedRestarter regression.

1321bad

Merge pull request #162 from hexinw-nvidia/log

efc0ff0

InJob logger regression

Consider torch version for FR env variables

ccc1bbb

`TORCH_NCCL_BUFFER_SIZE` (< 2.8), `TORCH_FR_BUFFER_SIZE`(>=2.8) additionally

address review comments

2337cc0

lint

4105715

backport from mcore

bed9794

make async queue session scoped

af4a22b

fail on rank 1, since may not have >2 world size

07263b2

fix merge conflict

9ceb09f

Merge pull request #163 from sbak5/sbak/fr_doc_fix

adad461

Consider torch version for FR env variables

make exhaustive tests optional

1773e19

Add Python and documentation updates from diegs-slurmify branch

731206f

- Update async_ckpt.py example with improved functionality - Update async_writer.py with MSC support and bug fixes - Update local_ckpt.py example - Update usage guide documentation

Merge pull request #168 from NVIDIA/fix_ut_logger_sq

5db57bc

Make logging exhaustive UT optional

Allow multiple AsyncCallsQueue in a process to enable different behav…

ab1ddc2

…ior like dist and non-dist CP

Merge branch 'main' into fix-error-propagation-bug

1c7c3b4

Merge branch 'main' into python-docs-updates

28b4ef6

Merge pull request #138 from diegs/fix-error-propagation-bug

25df0ac

checkpointing: fix error propagation and add test

Merge branch 'main' into python-docs-updates

ed6d75a

Use previous logging file when available

15cb997

Merge pull request #101 from diegs/python-docs-updates

95c2dfe

examples: update to add MSC support, fix multi-server support, and update docs

Allow multiple AsyncCallsQueue in a process to enable different behav…

62dae90

…ior like dist and non-dist CP

Merge branch 'main' into use_prev_log_file

678ada5

NVRxAttribution base class to implement a modular attribution pipeline

bf265de

Add a doc string in base.py and a unit test to validate the functio…

071c610

…nality

Remove unnecessary tests

7f49e4a

Apply Linting

186d394

Merge pull request #171 from NVIDIA/use_prev_log_file

afc4267

Use previous logging file when available

hexinw-nvidia and others added 22 commits October 7, 2025 16:15

Merge branch 'main' into tflops

d2009e2

.

8e4d7e1

Fixed placeholder comment.

76ddc7f

Donot support hot spare when use_infra_group_rank is true.

0b4987a

Merge pull request #195 from hexinw-nvidia/rdzv_endpoint

258e007

fix: require explicit rdzv-endpoint for c10d backend

Merge branch 'main' into tflops

4a4a173

Merge pull request #196 from hexinw-nvidia/tflops

04e2a79

feat: Add infrastructure rank support and optimize section monitoring

Merge branch 'main' into herman/5504235

615275c

Merge pull request #193 from herman-ai/herman/5504235

3850483

fix(async_ckpt): prevent cross-call state pollution in AsyncRequest

Flight Recorder Attribution module to find ranks interrupting the pro…

cd91156

…gress of the workload

Fix the attribution logic to handle cases for PyT watchdog

c8b18ee

Add Sample FR traces

3f17315

Add reference output and add a unit test validating steps in fr attri…

3d95713

…bution and compare output with reference

Add a rule to use max_enqueued_collective_seq_id to detect host issues

1a24d0b

Add minor fix to handle local to global pg id mapping

6062163

Exclude complete entries at the moment

6bdcd77

Change any remaining print to logger

283259f

Update fr_attribution.py to be used without llm

6fd0e8e

Merge branch 'main' into sbak/fr_attr_pr_squashed

a7909b1

Merge pull request #172 from sbak5/sbak/fr_attr_pr_squashed

f614f56

feat: Flight recorder attribution module

Training exit after Inprocess abort should ensure clean shutdown of p…

43c620f

…ersistent async workers

Merge pull request #199 from NVIDIA/abasant/persistent_worker_del_aft…

e26efe3

…er_abort Training exit after Inprocess abort should ensure clean shutdown of PersistentAsync worker

hexinw-nvidia requested review from aartibasant, apaithankar, namitdhameja, rhewett-nv and sbak5 October 14, 2025 22:07

rhewett-nv approved these changes Oct 14, 2025

View reviewed changes

hexinw-nvidia added the ci-approved Approved to run CI label Oct 14, 2025

hexinw-nvidia closed this Oct 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Slamp dump 0.5 from main #202

Slamp dump 0.5 from main #202

Uh oh!

hexinw-nvidia commented Oct 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

Slamp dump 0.5 from main #202

Slamp dump 0.5 from main #202

Uh oh!

Conversation

hexinw-nvidia commented Oct 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants