PersistentAsyncCaller.__del__ crashes during shutdown: get_rank() after process group destroyed

## Bug Description

`PersistentAsyncCaller.__del__()` in `megatron/core/dist_checkpointing/strategies/async_utils.py` calls `torch.distributed.get_rank()` in its `close()` method (line 429), but during Python shutdown/garbage collection the distributed process group has already been destroyed, causing a `ValueError`.

This happens on every rank at the end of every training run when async checkpointing is enabled.

## Traceback

```
Exception ignored in: <function PersistentAsyncCaller.__del__ at 0x746f0cbc8d60>
Traceback (most recent call last):
  File ".../megatron/core/dist_checkpointing/strategies/async_utils.py", line 446, in __del__
    self.close()
  File ".../megatron/core/dist_checkpointing/strategies/async_utils.py", line 429, in close
    f"PersistentAsyncCaller: {torch.distributed.get_rank()}, Destroying Async Caller"
                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".../torch/distributed/distributed_c10d.py", line 2354, in get_rank
    default_pg = _get_default_group()
  File ".../torch/distributed/distributed_c10d.py", line 1306, in _get_default_group
    raise ValueError(
ValueError: Default process group has not been initialized, please make sure to call init_process_group.
```

## Root Cause

`async_utils.py:429` unconditionally calls `torch.distributed.get_rank()` inside a debug log message in `close()`. When `close()` is called from `__del__` during garbage collection, the process group is already torn down.

## Suggested Fix

Guard the `get_rank()` call in `close()`:

```python
def close(self):
    if torch.distributed.is_initialized():
        logger.debug(
            f"PersistentAsyncCaller: {torch.distributed.get_rank()}, Destroying Async Caller"
        )
    # ... rest of cleanup
```

## Environment

- Megatron-Core (via Megatron-Bridge 0.3.0, NGC nemo:base-26.02 image)
- PyTorch 2.x
- Python 3.12
- Async checkpoint saving enabled (`config.checkpoint.async_save = True`)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PersistentAsyncCaller.del crashes during shutdown: get_rank() after process group destroyed #3775

Bug Description

Traceback

Root Cause

Suggested Fix

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

PersistentAsyncCaller.__del__ crashes during shutdown: get_rank() after process group destroyed #3775

Description

Bug Description

Traceback

Root Cause

Suggested Fix

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

PersistentAsyncCaller.del crashes during shutdown: get_rank() after process group destroyed #3775