Open
Description
We have a gap in our CI for the DPO recipes which we should address. To do this, we should:
- Verify correctness of the algorithm against some reference implementation (maybe this is overkill, but it's been a while since it was originally contributed). This may also be an opportunity to ensure it works OK on other datasets. cc @RdoubleA RE Anthropic HH
- Run the model with some mock inputs/models to obtain reference loss values.
- Write a test which ensures that when the recipe is launched with the above inputs/models it obtains expected loss values. This should sufficiently guard the recipe against future changes which may trigger re-verifying the correctness.
See similar recipe tests for PPO, LoRA fine-tune, etc.
Activity