stage 2 training with CP>1

Hi, it is my understanding that `calculate_per_token_loss: false` is required for sample-level loss... Stage 2 in the paper, however, it seems that this requires `CP==1`? Does this mean that Nemotron at 512k was trained without CP?