Summary
NeMo RL hardcodes several Megatron infrastructure parameters (I/O, timeouts, diagnostics) that the underlying framework already supports as configurable. In order to maximize performance on varied storage backends (e.g. NFS), we'd like these exposed in user config.
Parameters to expose
All in nemo_rl/models/megatron/setup.py:
| Parameter |
Line |
Current |
Issue |
async_save |
904 |
False |
Checkpoints block training; async plumbing exists but is disabled |
init_process_group timeout |
188 |
600s (default) |
Too short for 120B weight conversion; env vars don't override it |
check_for_nan_in_grad |
608 |
True |
Unnecessary overhead in production |
logging_level |
600 |
0 |
Not overridable |
fully_parallel_save/load, load_rng |
904 |
Hardcoded |
No user override for debugging or reproducibility |
Proposed config
checkpointing:
async_save: true
fully_parallel_save: true
fully_parallel_load: true
load_rng: false
megatron_cfg:
distributed_timeout_minutes: 120
check_for_nan_in_grad: false
logging_level: 0
Environment
- NeMo RL v0.5.0rc0, Megatron-LM (
3rdparty/Megatron-LM, branch yifu/superv3)
- 6x DGX B200, Kubernetes 1.31, KubeRay 1.6.0, NFS storage
- Model: Nemotron 3 Super 120B, GRPO with DAPO-Math-17k
Workarounds
async_save: patch line 904 (async_save=True)
- Timeout: patch line 188 (
timeout=timedelta(minutes=120))
- NaN check: patch line 608 (
check_for_nan_in_grad=False)
Summary
NeMo RL hardcodes several Megatron infrastructure parameters (I/O, timeouts, diagnostics) that the underlying framework already supports as configurable. In order to maximize performance on varied storage backends (e.g. NFS), we'd like these exposed in user config.
Parameters to expose
All in
nemo_rl/models/megatron/setup.py:async_saveFalseinit_process_grouptimeoutcheck_for_nan_in_gradTruelogging_level0fully_parallel_save/load,load_rngProposed config
Environment
3rdparty/Megatron-LM, branchyifu/superv3)Workarounds
async_save: patch line 904 (async_save=True)timeout=timedelta(minutes=120))check_for_nan_in_grad=False)