Context
When _apply_performance_config() (and similar _apply_* functions in nemo_rl/models/megatron/setup.py) set fields on model_cfg, they do so via plain attribute assignment after the TransformerConfig dataclass has already been constructed and __post_init__ has run. This means MCore's built-in validation (e.g., checking recompute_modules against the allowed set) is bypassed for user-supplied values.
Call chain
MegatronPolicyWorker.__init__ → validate_and_set_config()
validate_and_set_config() → setup_model_config()
setup_model_config() calls ConfigContainer.from_yaml() — __post_init__ runs here
_apply_performance_config() mutates model_cfg via attribute assignment — __post_init__ does not re-run
Problem
A typo in a user config (e.g., recompute_modules: ["mopee"]) is silently accepted. MCore's runtime checks match module names via string comparison, so an unrecognized name is a silent no-op — the user thinks they're saving memory but nothing is actually recomputed.
Proposal
Consider refactoring the config initialization flow so that MCore's __post_init__ validation runs after all _apply_* mutations are complete. Options include:
- Constructing the
TransformerConfig from a merged dict (user overrides applied first, then construct once)
- Re-running
__post_init__() after all mutations (need to verify safety — it also sets defaults)
- Extracting MCore's validation into a callable utility and invoking it post-mutation
This would eliminate the need to duplicate validation logic in NeMo-RL and ensure any future MCore validation additions are automatically picked up.
Related
Surfaced during review of PR #2280 (selective activation checkpointing).
Context
When
_apply_performance_config()(and similar_apply_*functions innemo_rl/models/megatron/setup.py) set fields onmodel_cfg, they do so via plain attribute assignment after theTransformerConfigdataclass has already been constructed and__post_init__has run. This means MCore's built-in validation (e.g., checkingrecompute_modulesagainst the allowed set) is bypassed for user-supplied values.Call chain
MegatronPolicyWorker.__init__→validate_and_set_config()validate_and_set_config()→setup_model_config()setup_model_config()callsConfigContainer.from_yaml()—__post_init__runs here_apply_performance_config()mutatesmodel_cfgvia attribute assignment —__post_init__does not re-runProblem
A typo in a user config (e.g.,
recompute_modules: ["mopee"]) is silently accepted. MCore's runtime checks match module names via string comparison, so an unrecognized name is a silent no-op — the user thinks they're saving memory but nothing is actually recomputed.Proposal
Consider refactoring the config initialization flow so that MCore's
__post_init__validation runs after all_apply_*mutations are complete. Options include:TransformerConfigfrom a merged dict (user overrides applied first, then construct once)__post_init__()after all mutations (need to verify safety — it also sets defaults)This would eliminate the need to duplicate validation logic in NeMo-RL and ensure any future MCore validation additions are automatically picked up.
Related
Surfaced during review of PR #2280 (selective activation checkpointing).