-
Notifications
You must be signed in to change notification settings - Fork 3.7k
Open
Labels
Description
Bug Description
PersistentAsyncCaller.__del__() in megatron/core/dist_checkpointing/strategies/async_utils.py calls torch.distributed.get_rank() in its close() method (line 429), but during Python shutdown/garbage collection the distributed process group has already been destroyed, causing a ValueError.
This happens on every rank at the end of every training run when async checkpointing is enabled.
Traceback
Exception ignored in: <function PersistentAsyncCaller.__del__ at 0x746f0cbc8d60>
Traceback (most recent call last):
File ".../megatron/core/dist_checkpointing/strategies/async_utils.py", line 446, in __del__
self.close()
File ".../megatron/core/dist_checkpointing/strategies/async_utils.py", line 429, in close
f"PersistentAsyncCaller: {torch.distributed.get_rank()}, Destroying Async Caller"
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File ".../torch/distributed/distributed_c10d.py", line 2354, in get_rank
default_pg = _get_default_group()
File ".../torch/distributed/distributed_c10d.py", line 1306, in _get_default_group
raise ValueError(
ValueError: Default process group has not been initialized, please make sure to call init_process_group.
Root Cause
async_utils.py:429 unconditionally calls torch.distributed.get_rank() inside a debug log message in close(). When close() is called from __del__ during garbage collection, the process group is already torn down.
Suggested Fix
Guard the get_rank() call in close():
def close(self):
if torch.distributed.is_initialized():
logger.debug(
f"PersistentAsyncCaller: {torch.distributed.get_rank()}, Destroying Async Caller"
)
# ... rest of cleanupEnvironment
- Megatron-Core (via Megatron-Bridge 0.3.0, NGC nemo:base-26.02 image)
- PyTorch 2.x
- Python 3.12
- Async checkpoint saving enabled (
config.checkpoint.async_save = True)
Reactions are currently unavailable