What
Add a parameter to wrapper_util.py like use_fully_parallel_wrapper that is False by default, and when True, wraps the save_strategy and load_strategy with FullyParallelSaveWrapper and FullyParallelLoadWrapper respectively (from megatron.core.dist_checkpointing.strategies.fully_parallel).
Add thorough tests accordingly.
Why
Gives users the flexibility to have even data distribution across ranks, without extra reliance on rank 0, similar to the recommended settings in NeMo 2.0. Without this, overall training time may still be the same/better at smaller cluster sizes, but may be worse at larger sizes.