Bug: shape mismatch in init_mask when assigning template to X[cmask]
Summary
During training, we intermittently hit a shape mismatch when writing template into X[cmask].
Command
python train.py --task struct_prediction --gpus 0 1 2 3 --wandb_offline 1 --flexible 1 --model_type dyAb --ex_name dyAb_struct_pred_2 --module_type 1 --batch_size 64
Error (exact)
RuntimeError: shape mismatch: value tensor of shape [4411, 14, 3] cannot be broadcast to indexing result of shape [4413, 14, 3]
[rank: 3] Child process with PID 2759890 terminated with code 1. Forcefully terminating all other processes to avoid zombies 🧟
Killed
Where
.../models/dyMEAN/dyAb_model.py, in init_mask:
X[cmask] = template
Context
- 4×GPU DDP
- struct_only=True
- Occasionally
cmask.sum() ≠ template.size(0) (diff by a small number).