Skip to content

[Main] Comprehensive Checkpointing #619

@JenniferWang

Description

@JenniferWang

Checkpointing is critical for bringing fault tolerance to large / long run training jobs.
As of today, we can enable saving and resuming model weights + optimizer + LR scheduler via Titan’s checkpointer.
However, this solution does not cover other components in the RL loop (e.g. dataset) and not generalizable to other trainers.

Sub-issues

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions