-
Notifications
You must be signed in to change notification settings - Fork 2
Description
Context
The current model with the following configuration produces very promising results.
Overall Configuration (not exhaustive) :
- region based training with minimal number of classes
- multichannel with phase + mag
- initial LR = 0.001
- LR scheduler = Cosine
- Optimizer = AdamW
- Mag preprocessing : light clahe
- Phase preprocessing : heavy contrast enhancement
- Augmentation : light spatial augmentation
Most of those hyperparameters were chosen based on intuition and personal experience. We'd like to make ablation studies to understand which hyperparameters really improve the results. We would also like to investigate new possible source of improvement such as (not exhaustive) :
- multichannel with mag & phase and with adjacent slices
- stronger augmentations
- stronger/weaker/no preprocessing
Goal
To make it possible to draw conclusions from experiments with different hyperparameters, we would like to make the results as reproducible as possible. The best way to do that is to enable deterministic training.
Issue
nnUnet does not currently support deterministic training, see issue 1423 of the nnunet repo. And no evolution is planned regarding that matter.
Current investigation
I added the basics that should make the training deterministic, but it's not yet.
seed=42
random.seed(seed)
os.environ['PYTHONHASHSEED'] = str(seed)
np.random.seed(seed)
torch.manual_seed(seed)
# I'm on mps so no CUDA-specific seeding is needed
torch.use_deterministic_algorithms(True)
I then noticed that even the dataloaders were not deterministic, so changed this code, to this :
if allowed_num_processes == 0:
mt_gen_train = SingleThreadedAugmenter(dl_tr, None)
mt_gen_val = SingleThreadedAugmenter(dl_val, None)
else:
train_seeds = [MASTER_SEED + i for i in range(allowed_num_processes)]
num_val_processes = max(1, allowed_num_processes // 2)
val_seeds = [MASTER_SEED + 1000 + i for i in range(num_val_processes)] # Use an offset to avoid overlap
mt_gen_train = MultiThreadedAugmenter(data_loader=dl_tr, transform=None,
num_processes=allowed_num_processes,
num_cached_per_queue=max(6, allowed_num_processes // 2),
seeds=train_seeds,
pin_memory=self.device.type == 'cuda', wait_time=0.002)
mt_gen_val = MultiThreadedAugmenter(data_loader=dl_val, transform=None,
num_processes=num_val_processes,
num_cached_per_queue=max(3, allowed_num_processes // 4),
seeds=val_seeds,
pin_memory=self.device.type == 'cuda',
wait_time=0.002)
Now the dataloaders are deterministic, but the training still isn't.
Next step
- Remove augmentation, this may be the source of non determinism
Possible workaround
I'm currently training 10 models (6/10 right now) on 40 epochs to have an estimate of the variability of the loss and metrics during training. If we cannot make nnunet deterministic, we can use this to have a better understanding of the impact of each hyperparameters. Note : this will involves training (on 40 epochs) several times for each hyperparameters modification, this is very time consuming and not ideal