Expose hardcoded Megatron infrastructure parameters to user config

## Summary

NeMo RL hardcodes several Megatron infrastructure parameters (I/O, timeouts, diagnostics) that the underlying framework already supports as configurable. In order to maximize performance on varied storage backends (e.g. NFS), we'd like these exposed in user config.

## Parameters to expose

All in `nemo_rl/models/megatron/setup.py`:

| Parameter | Line | Current | Issue |
|-----------|------|---------|-------|
| `async_save` | 904 | `False` | Checkpoints block training; async plumbing exists but is disabled |
| `init_process_group` timeout | 188 | 600s (default) | Too short for 120B weight conversion; env vars don't override it |
| `check_for_nan_in_grad` | 608 | `True` | Unnecessary overhead in production |
| `logging_level` | 600 | `0` | Not overridable |
| `fully_parallel_save/load`, `load_rng` | 904 | Hardcoded | No user override for debugging or reproducibility |

## Proposed config

```yaml
checkpointing:
  async_save: true
  fully_parallel_save: true
  fully_parallel_load: true
  load_rng: false

megatron_cfg:
  distributed_timeout_minutes: 120
  check_for_nan_in_grad: false
  logging_level: 0
```

## Environment

- NeMo RL v0.5.0rc0, Megatron-LM (`3rdparty/Megatron-LM`, branch `yifu/superv3`)
- 6x DGX B200, Kubernetes 1.31, KubeRay 1.6.0, NFS storage
- Model: Nemotron 3 Super 120B, GRPO with DAPO-Math-17k

## Workarounds

- `async_save`: patch line 904 (`async_save=True`)
- Timeout: patch line 188 (`timeout=timedelta(minutes=120)`)
- NaN check: patch line 608 (`check_for_nan_in_grad=False`)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expose hardcoded Megatron infrastructure parameters to user config #2229

Summary

Parameters to expose

Proposed config

Environment

Workarounds

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Parameter	Line	Current	Issue
`async_save`	904	`False`	Checkpoints block training; async plumbing exists but is disabled
`init_process_group` timeout	188	600s (default)	Too short for 120B weight conversion; env vars don't override it
`check_for_nan_in_grad`	608	`True`	Unnecessary overhead in production
`logging_level`	600	`0`	Not overridable
`fully_parallel_save/load`, `load_rng`	904	Hardcoded	No user override for debugging or reproducibility

Expose hardcoded Megatron infrastructure parameters to user config #2229

Description

Summary

Parameters to expose

Proposed config

Environment

Workarounds

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions