- [ ] advantage normalization (batch level) - [ ] zero-variance filtering - [ ] FP32 lm head (train and gen) - [ ] CISPO - [ ] no-positive resampling (data curriculum: skip >=0.9 pass rate) cc/ @parthchadha