Description
🐛 Bug
I am training training two jobs that use the exact same code but different h5py files (same keys) to represent the data. One of the hdf5 files leads to a dataset that has 650,000,000 data points in the training set and one has 150,000,000. The training job with the larger dataset seems to just stop doing anything after a number of hours. After some number of hours, the CPU activity seems to die down. Neither job is out of memory (either system or GPU) but the larger job just seems to... stop working (although it doesn't crash).
Any ideas what might be going on? Both are still running and I can provide any info that might be helpful.
PyTorch Lightning Version (e.g., 1.5.0): 1.6.3
PyTorch Version (e.g., 1.10): 1.11
Python version (e.g., 3.9): 3.8
OS (e.g., Linux): Ubuntu 20.04
CUDA/cuDNN version: 11.3
GPU models and configuration: 8 A100s using ddp
How you installed PyTorch (conda, pip, source): pip
If compiling from source, the output of torch.config.show():
Any other relevant information: