Skip to content

PersistentAsyncCaller.__del__ crashes during shutdown: get_rank() after process group destroyed #3775

@psycherun-kaiko

Description

@psycherun-kaiko

Bug Description

PersistentAsyncCaller.__del__() in megatron/core/dist_checkpointing/strategies/async_utils.py calls torch.distributed.get_rank() in its close() method (line 429), but during Python shutdown/garbage collection the distributed process group has already been destroyed, causing a ValueError.

This happens on every rank at the end of every training run when async checkpointing is enabled.

Traceback

Exception ignored in: <function PersistentAsyncCaller.__del__ at 0x746f0cbc8d60>
Traceback (most recent call last):
  File ".../megatron/core/dist_checkpointing/strategies/async_utils.py", line 446, in __del__
    self.close()
  File ".../megatron/core/dist_checkpointing/strategies/async_utils.py", line 429, in close
    f"PersistentAsyncCaller: {torch.distributed.get_rank()}, Destroying Async Caller"
                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".../torch/distributed/distributed_c10d.py", line 2354, in get_rank
    default_pg = _get_default_group()
  File ".../torch/distributed/distributed_c10d.py", line 1306, in _get_default_group
    raise ValueError(
ValueError: Default process group has not been initialized, please make sure to call init_process_group.

Root Cause

async_utils.py:429 unconditionally calls torch.distributed.get_rank() inside a debug log message in close(). When close() is called from __del__ during garbage collection, the process group is already torn down.

Suggested Fix

Guard the get_rank() call in close():

def close(self):
    if torch.distributed.is_initialized():
        logger.debug(
            f"PersistentAsyncCaller: {torch.distributed.get_rank()}, Destroying Async Caller"
        )
    # ... rest of cleanup

Environment

  • Megatron-Core (via Megatron-Bridge 0.3.0, NGC nemo:base-26.02 image)
  • PyTorch 2.x
  • Python 3.12
  • Async checkpoint saving enabled (config.checkpoint.async_save = True)

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions