Skip to content

Expose hardcoded Megatron infrastructure parameters to user config #2229

@nic-nvidia

Description

@nic-nvidia

Summary

NeMo RL hardcodes several Megatron infrastructure parameters (I/O, timeouts, diagnostics) that the underlying framework already supports as configurable. In order to maximize performance on varied storage backends (e.g. NFS), we'd like these exposed in user config.

Parameters to expose

All in nemo_rl/models/megatron/setup.py:

Parameter Line Current Issue
async_save 904 False Checkpoints block training; async plumbing exists but is disabled
init_process_group timeout 188 600s (default) Too short for 120B weight conversion; env vars don't override it
check_for_nan_in_grad 608 True Unnecessary overhead in production
logging_level 600 0 Not overridable
fully_parallel_save/load, load_rng 904 Hardcoded No user override for debugging or reproducibility

Proposed config

checkpointing:
  async_save: true
  fully_parallel_save: true
  fully_parallel_load: true
  load_rng: false

megatron_cfg:
  distributed_timeout_minutes: 120
  check_for_nan_in_grad: false
  logging_level: 0

Environment

  • NeMo RL v0.5.0rc0, Megatron-LM (3rdparty/Megatron-LM, branch yifu/superv3)
  • 6x DGX B200, Kubernetes 1.31, KubeRay 1.6.0, NFS storage
  • Model: Nemotron 3 Super 120B, GRPO with DAPO-Math-17k

Workarounds

  • async_save: patch line 904 (async_save=True)
  • Timeout: patch line 188 (timeout=timedelta(minutes=120))
  • NaN check: patch line 608 (check_for_nan_in_grad=False)

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions