Skip to content

Loosening the branching constraint #3

@christopher-beckham

Description

@christopher-beckham

Hi,

I have a question related how branching works in this, and if it is possible to play around with this constraint. I'll give an example.

Suppose I have some experiment experiment1, with the hash aaabbb. When I run this experiment, it will periodically dump checkpoints and output files to say, $ROOT/aaabbb/results. Suppose I have trained this experiment for 50 epochs already (and therefore the results folder contains a checkpoint file for epoch 50).

Now suppose in the next few days I decide to commit various changes to parts of the code. These code changes could be things which impact very little / nothing in experiment aaabbb (e.g. suppose I add new files irrelevant to that experiment, or change the way metrics are logged). If I decide to resume experiment aaabbb and train it for another 50 epochs, it won't allow me to since the environment has changed. If I decide to resume aaabbb using something like --allow-any-change then it will simply branch off that experiment and start over again (suppose the branched experiment is called cccddd). This means that the experiment will actually be run again from scratch, since cccddd is definitely not going to inherit the checkpoint file from aaabbb. This leaves us with a few options:

  • (1) Allow the user to sin and let them resume aaabbb without branching (but maybe for peace of mind, somehow let the user know that these are 'dirty' experiments, so that they can separate them from the truly reproducible experiments ;) )
  • (2) Force the user to resume aaabbb outside of using kleio. But this means you'll have to account for the fact that kleio sets the working directory to be $ROOT/<id> instead of $ROOT. (I don't really like this one)
  • (3) Have the branching code copy the contents of $ROOT/<id> to $ROOT/<id_branched>, though this might use a lot of disk space if $ROOT/<id> already has a ton of stuff in it (like big checkpoint files).
  • (4) Modify the command line for the branched experiment and add something like --resume_from=$ROOT/aaabbb/epoch_50.pkl. Maybe this is a reasonable one? I assume this is already possible. Though (1) would definitely be the most convenient for me (I may not always want to spread my experiment over multiple branches).

Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions