-
Hello,
and using a smaller model, because I noticed in the NeMo code: NeMo/nemo/lightning/_strategy_lib.py Line 92 in dc08edd I am not very familiar with training or HPC applications, but why does this feature require MPI, and cannot use NCCL? I read this blog post to try and understand what tensor parallel communication overlap is, but I can't figure out why I need MPI for it. Thanks! |
Beta Was this translation helpful? Give feedback.
Answered by
ashors1
May 2, 2025
Replies: 1 comment
-
Hi, MPI is used by default to bootstrap the user buffers (see the TransformerEngine documentation here). However, NCCL bootstrap should also be supported now. You can try setting |
Beta Was this translation helpful? Give feedback.
0 replies
Answer selected by
ashors1
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Hi, MPI is used by default to bootstrap the user buffers (see the TransformerEngine documentation here). However, NCCL bootstrap should also be supported now. You can try setting
tp_comm_bootstrap_backend="nccl"
here.