[QUESTION] Training Llama3 70B on 16 x A100 only achieves low throughput of 20 TFLOPS

**Your question**

Machine: 2 nodes * 8 A100

TP=8
PP=2
DP=1
CP=1
seq_length=4096
micro_batch_size=1
global_batch_size=1

enable recompute activation, flash attention, distribute optimizer

Megatron version: core_v0.7.0

Thanks for you help!