NVIDIA/Megatron-LM · Discussions · GitHub

Discussions

You must be logged in to vote

[QUESTION] Training Llama3 70B on 16 x A100 only achieves low throughput of 20 TFLOPS

ZeroAGI asked Aug 13, 2024 in Q&A · Unanswered

4
You must be logged in to vote

[QUESTION]Using FP8 OOM, otherwise --bf16 works well

yanchenmochen asked Oct 23, 2024 in Q&A · Unanswered

1
You must be logged in to vote

Does Megatron support tracing computation graphs with torch.fx?

fy-j asked Dec 15, 2024 in Q&A · Unanswered

0
You must be logged in to vote

What git branch corresponds to the latest version and what Docker version to use?

karim-manaouil asked Dec 4, 2024 in Q&A · Unanswered

0
You must be logged in to vote

[QUESTION] How to load checkpoint saved in one parallel configuration (tensor/pipeline/data parallelism) can be loaded in a different parallel configuration ?

polisettyvarma asked Oct 21, 2024 in Q&A · Unanswered

3
You must be logged in to vote

How is micro-batch-size influencing the throughput per GPU ?

bugm asked Oct 12, 2024 in Q&A · Unanswered

1
You must be logged in to vote

[QUESTION] How Do NCCL_ALGO and Flash Attention Affect Deterministic Training in Megatron?

jinzhuer asked Jul 11, 2024 in Q&A · Unanswered

2
You must be logged in to vote

When will megatron Flash attention 3 be supported？

echo-valor asked Jul 19, 2024 in Q&A · Unanswered

1
You must be logged in to vote

[QUESTION] Do tp overlap support thd, whose sequence length is flexible?

wplf asked Oct 23, 2024 in Q&A · Unanswered

0
You must be logged in to vote

[QUESTION]How to setup fp8 trainging using Megatron-LM

yanchenmochen asked Oct 11, 2024 in Q&A · Unanswered

1
You must be logged in to vote

[QUESTION] What's the internal difference for training when setting only "fp8-format" or setting "fp8-format"+"bf16"
stale No activity in 60 days on issue or PR
dong-liuliu asked Jun 21, 2024 in Q&A · Unanswered

4
You must be logged in to vote

[QUESTION]how to incorporate MOE into hybrid mamba efficiently

sunying2018 asked Oct 21, 2024 in Q&A · Unanswered

0
You must be logged in to vote

[QUESTION] When will multimodal evaluation support pipeline parallelism ?

wangxiaoyang-dev asked Oct 21, 2024 in Q&A · Unanswered

0
You must be logged in to vote

[QUESTION] why all rank0 of tp group make datasets？

ltm920716 asked Oct 18, 2024 in Q&A · Unanswered

0
You must be logged in to vote

[QUESTION] How will recompute-num-layers influence the gpu memory usage with uniform recompute-method.

bugm asked Oct 18, 2024 in Q&A · Unanswered

0
You must be logged in to vote

how to use TikTokenizer during Training?

bugm asked Oct 12, 2024 in Q&A · Unanswered

0
You must be logged in to vote

[QUESTION] How to convert a checkpoint to virtual pipeline format checkpoint

ZacWang asked Oct 11, 2024 in Q&A · Unanswered

0
You must be logged in to vote

[QUESTION] Need some clarification regarding tensor parallelism on Multimodal example

luisfrentzen asked Oct 4, 2024 in Q&A · Unanswered

0
You must be logged in to vote

[QUESTION]How to convert a huggingface checkpoint, and also use PP > 1 or TP > 1

sambar1729 asked Aug 6, 2024 in Q&A · Unanswered

1
You must be logged in to vote

[QUESTION] Validation loss & PPL keep going up

zhentingqi asked Apr 20, 2024 in Q&A · Unanswered

2
You must be logged in to vote

[QUESTION]Why does Megatron-LM using gloo backend when Creating Parrallel Group ?
stale No activity in 60 days on issue or PR
wuyingjun-lucky asked Mar 21, 2024 in Q&A · Unanswered

4
You must be logged in to vote

[QUESTION]which torch version can work with ring_exchange?
stale No activity in 60 days on issue or PR
JYXL1 asked Feb 2, 2024 in Q&A · Unanswered

2
You must be logged in to vote

TikTokenizer tiktoken-pattern v1 and v2

bugm asked Sep 18, 2024 in Q&A · Unanswered

0
You must be logged in to vote

[QUESTION] Training Mixtral 8x7B on 16 x H100 only achieves low throughput of 130 TFLOPS
stale No activity in 60 days on issue or PR
ShinoharaHare asked Mar 30, 2024 in Q&A · Unanswered

28
You must be logged in to vote

[QUESTION] The Reason for calling torch.cuda.synchronize() in func recv_from_prev_pipeline_rank_/send_to_next_pipeline_rank

CCCCarpediem asked Sep 14, 2024 in Q&A · Unanswered

0