Skip to content

Training with Large Dataset Causes Infinite Stall #13126

Open
@fishbotics

Description

@fishbotics

🐛 Bug

I am training training two jobs that use the exact same code but different h5py files (same keys) to represent the data. One of the hdf5 files leads to a dataset that has 650,000,000 data points in the training set and one has 150,000,000. The training job with the larger dataset seems to just stop doing anything after a number of hours. After some number of hours, the CPU activity seems to die down. Neither job is out of memory (either system or GPU) but the larger job just seems to... stop working (although it doesn't crash).

Any ideas what might be going on? Both are still running and I can provide any info that might be helpful.

PyTorch Lightning Version (e.g., 1.5.0): 1.6.3
PyTorch Version (e.g., 1.10): 1.11
Python version (e.g., 3.9): 3.8
OS (e.g., Linux): Ubuntu 20.04
CUDA/cuDNN version: 11.3
GPU models and configuration: 8 A100s using ddp
How you installed PyTorch (conda, pip, source): pip
If compiling from source, the output of torch.config.show():
Any other relevant information:

Additional context

cc @Borda @akihironitta

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingperformancewaiting on authorWaiting on user action, correction, or update

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions