-
Notifications
You must be signed in to change notification settings - Fork 2
Description
Hi,
I have a question related how branching works in this, and if it is possible to play around with this constraint. I'll give an example.
Suppose I have some experiment experiment1, with the hash aaabbb. When I run this experiment, it will periodically dump checkpoints and output files to say, $ROOT/aaabbb/results. Suppose I have trained this experiment for 50 epochs already (and therefore the results folder contains a checkpoint file for epoch 50).
Now suppose in the next few days I decide to commit various changes to parts of the code. These code changes could be things which impact very little / nothing in experiment aaabbb (e.g. suppose I add new files irrelevant to that experiment, or change the way metrics are logged). If I decide to resume experiment aaabbb and train it for another 50 epochs, it won't allow me to since the environment has changed. If I decide to resume aaabbb using something like --allow-any-change then it will simply branch off that experiment and start over again (suppose the branched experiment is called cccddd). This means that the experiment will actually be run again from scratch, since cccddd is definitely not going to inherit the checkpoint file from aaabbb. This leaves us with a few options:
- (1) Allow the user to sin and let them resume
aaabbbwithout branching (but maybe for peace of mind, somehow let the user know that these are 'dirty' experiments, so that they can separate them from the truly reproducible experiments ;) ) - (2) Force the user to resume
aaabbboutside of usingkleio. But this means you'll have to account for the fact thatkleiosets the working directory to be$ROOT/<id>instead of$ROOT. (I don't really like this one) - (3) Have the branching code copy the contents of
$ROOT/<id>to$ROOT/<id_branched>, though this might use a lot of disk space if$ROOT/<id>already has a ton of stuff in it (like big checkpoint files). - (4) Modify the command line for the branched experiment and add something like
--resume_from=$ROOT/aaabbb/epoch_50.pkl. Maybe this is a reasonable one? I assume this is already possible. Though (1) would definitely be the most convenient for me (I may not always want to spread my experiment over multiple branches).
Thanks!