Checkpointing is critical for bringing fault tolerance to large / long run training jobs.
As of today, we can enable saving and resuming model weights + optimizer + LR scheduler via Titan’s checkpointer.
However, this solution does not cover other components in the RL loop (e.g. dataset) and not generalizable to other trainers.