Training with Large Dataset Causes Infinite Stall

## 🐛 Bug
I am training training two jobs that use the exact same code but different h5py files (same keys) to represent the data. One of the hdf5 files leads to a dataset that has 650,000,000 data points in the training set and one has 150,000,000. The training job with the larger dataset seems to just stop doing anything after a number of hours. After some number of hours, the CPU activity seems to die down. Neither job is out of memory (either system or GPU) but the larger job just seems to... stop working (although it doesn't crash).

Any ideas what might be going on? Both are still running and I can provide any info that might be helpful.

PyTorch Lightning Version (e.g., 1.5.0): 1.6.3
PyTorch Version (e.g., 1.10): 1.11
Python version (e.g., 3.9): 3.8
OS (e.g., Linux): Ubuntu 20.04
CUDA/cuDNN version: 11.3
GPU models and configuration: 8 A100s using ddp
How you installed PyTorch (conda, pip, source): pip
If compiling from source, the output of torch.__config__.show():
Any other relevant information:

### Additional context




cc @borda @akihironitta

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Training with Large Dataset Causes Infinite Stall #13126

🐛 Bug

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Training with Large Dataset Causes Infinite Stall #13126

Description

🐛 Bug

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions