Skip to content

Add Checkpointing / Resume Functionality for Experiments #3

@AWbosman

Description

@AWbosman

Background

Currently, if an experiment run is interrupted (e.g., due to manual cancellation, a node going down, or other unexpected issues), it is not possible to resume the experiment without restarting everything from scratch.
This wastes time and compute, especially for large-scale experiments.

Proposal

Implement a checkpointing / resume system that allows VERONA to continue experiments from where they left off.

Key ideas:

  • Track which experiment instances have already been completed (e.g., store this in a results CSV or a lightweight database).
  • On restarting, check which instances are missing or incomplete.
  • Continue execution only for the incomplete instances.

Benefits

  • Saves time by avoiding re-running completed instances.
  • Makes VERONA more robust to interruptions and cluster instability.
  • Improves user experience for long-running experiments.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions