You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, it is my understanding that calculate_per_token_loss: false is required for sample-level loss... Stage 2 in the paper, however, it seems that this requires CP==1? Does this mean that Nemotron at 512k was trained without CP?
Hi, it is my understanding that
calculate_per_token_loss: falseis required for sample-level loss... Stage 2 in the paper, however, it seems that this requiresCP==1? Does this mean that Nemotron at 512k was trained without CP?