Skip to content

Conversation

@limou102
Copy link
Contributor

@limou102 limou102 commented Nov 4, 2025

  1. Adapt the latest Megatron‑LM to support the torch_dist checkpoint format.
    Add 2 checkpoint arguments below, which are in latest Megatron-LM
dist_ckpt_save_pre_mcore_014
dist_ckpt_optim_fully_reshardable
  1. Update the async checkpoint patch logic.
    In ROCm version >= 7.1, the HIP runtime fixed a bug where, after the main process allocated pinned memory and then forked child processes, the child processes’ access to that memory would cause a segmentation fault.
    libhsakmt: Don't use MADV_DONTFORK for paged memory ROCm/rocm-systems#356

So the previous patch for FileSystemWriterAsync can be removed in this case.

…mat, and update the async checkpoint patch logic
@wenxie-amd wenxie-amd merged commit 0426a42 into main Nov 4, 2025
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants