**Your question** Machine: 2 nodes * 8 A100 TP=8 PP=2 DP=1 CP=1 seq_length=4096 micro_batch_size=1 global_batch_size=1 enable recompute activation, flash attention, distribute optimizer Megatron version: core_v0.7.0 Thanks for you help!