Why does Tensor-parallel Communication Overlap require MPI? #11849

mjkpolo · 2025-01-14T19:50:18Z

mjkpolo
Jan 14, 2025

Hello,
I was trying to replicate the MLCommons llama 2 70b training with h200s, but the cluster I am using doesn't support MPI. I got it working by setting

export TP_COMM_OVERLAP=0

and using a smaller model, because I noticed in the NeMo code:

NeMo/nemo/lightning/_strategy_lib.py

Line 92 in dc08edd

init_mpi_proc_group=getattr(parallel_config, "tp_comm_overlap", False)

I am not very familiar with training or HPC applications, but why does this feature require MPI, and cannot use NCCL? I read this blog post to try and understand what tensor parallel communication overlap is, but I can't figure out why I need MPI for it.

Thanks!

Answered by ashors1

May 2, 2025

Hi, MPI is used by default to bootstrap the user buffers (see the TransformerEngine documentation here). However, NCCL bootstrap should also be supported now. You can try setting tp_comm_bootstrap_backend="nccl" here.

View full answer

ashors1 · 2025-05-02T17:52:07Z

ashors1
May 2, 2025
Collaborator

Hi, MPI is used by default to bootstrap the user buffers (see the TransformerEngine documentation here). However, NCCL bootstrap should also be supported now. You can try setting tp_comm_bootstrap_backend="nccl" here.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Why does Tensor-parallel Communication Overlap require MPI? #11849

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Why does Tensor-parallel Communication Overlap require MPI? #11849

Uh oh!

mjkpolo Jan 14, 2025

Replies: 1 comment

Uh oh!

ashors1 May 2, 2025 Collaborator

mjkpolo
Jan 14, 2025

ashors1
May 2, 2025
Collaborator