Skip to content

[Bug] OSError in DataLoader worker process 0 #602

Open
@austinmw

Description

@austinmw

Branch

1.x branch (1.x version, such as v1.0.0rc2, or dev-1.x branch)

Prerequisite

Environment

sys.platform: linux
Python: 3.8.13 | packaged by conda-forge | (default, Mar 25 2022, 06:04:10) [GCC 10.3.0]
CUDA available: True
numpy_random_seed: 2147483648
GPU 0,1,2,3: Tesla V100-SXM2-16GB
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 11.3, V11.3.109
GCC: gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
PyTorch: 1.12.1+cu113
PyTorch compiling details: PyTorch built with:
  - GCC 9.3
  - C++ Version: 201402
  - Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v2.6.0 (Git Hash 52b5f107dd9cf10910aaa19cb47f3abf9b349815)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 11.3
  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86
  - CuDNN 8.3.2  (built against CUDA 11.5)
  - Magma 2.5.2
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.3, CUDNN_VERSION=8.3.2, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -fabi-version=11 -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.12.1, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=OFF, USE_MPI=ON, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, 

TorchVision: 0.13.1+cu113
OpenCV: 4.6.0
MMEngine: 0.3.2
MMCV: 2.0.0rc3
MMCV Compiler: GCC 9.3
MMCV CUDA Compiler: 11.3
MMSelfSup: 1.0.0rc3+6db0433

Describe the bug

I'm attempting to train MAE with MPI across 4 nodes (AWS P3dn.24xlarge instances, so 8 32GB V100s each). It runs successfully for a while, but errors out after about 10 epochs. I've tried resuming a few times in a row, and it always seems to crash after 10-12 epochs.

Reproduces the problem - code sample

Run command below

Reproduces the problem - command or script

mpirun --host algo-3:8,algo-1:8,algo-2:8,algo-4:8 -np 32 --allow-run-as-root --tag-output --oversubscribe -mca btl_tcp_if_include eth0 -mca oob_tcp_if_include eth0 -mca plm_rsh_no_tree_spawn 1 -mca pml ob1 -mca btl ^openib -mca orte_abort_on_non_zero_status 1 -mca btl_vader_single_copy_mechanism none -mca plm_rsh_num_concurrent 4 -x NCCL_SOCKET_IFNAME=eth0 -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH -x SMDATAPARALLEL_USE_HOMOGENEOUS=1 -x FI_PROVIDER=efa -x RDMAV_FORK_SAFE=1 -x LD_PRELOAD=/opt/conda/lib/python3.8/site-packages/gethostname.cpython-38-x86_64-linux-gnu.so -verbose -x NCCL_DEBUG=VERSION -x FI_EFA_USE_DEVICE_RDMA=1 -x NCCL_PROTO=simple -x SMDATAPARALLEL_SERVER_ADDR=algo-3 -x SMDATAPARALLEL_SERVER_PORT=7592 -x SAGEMAKER_INSTANCE_TYPE=ml.p3dn.24xlarge smddprun /opt/conda/bin/python3.8 -m mpi4py sagemaker_train.py --amp True --cfg-options seed=0 log_level='INFO' visualizer.save_dir='/opt/ml/checkpoints' default_hooks.logger.interval=50 train_dataloader.dataset.data_root='/opt/ml/input/data/dataset' train_dataloader.batch_size=256 auto_scale_lr.base_batch_size=512auto_scale_lr.enable=True --config configs/selfsup/mae/mae_vit-base-p16_8xb512-amp-coslr-300e_in1k.py

Reproduces the problem - error message

[1,mpirank:1,algo-1]<stdout>:  File "sagemaker_train.py", line 172, in main
[1,mpirank:1,algo-1]<stdout>:    runner.train()
[1,mpirank:1,algo-1]<stdout>:  File "/opt/conda/lib/python3.8/site-packages/mmengine/runner/runner.py", line 1684, in train
[1,mpirank:1,algo-1]<stdout>:    model = self.train_loop.run()  # type: ignore
[1,mpirank:1,algo-1]<stdout>:  File "/opt/conda/lib/python3.8/site-packages/mmengine/runner/loops.py", line 90, in run
[1,mpirank:1,algo-1]<stdout>:    self.run_epoch()
[1,mpirank:1,algo-1]<stdout>:  File "/opt/conda/lib/python3.8/site-packages/mmengine/runner/loops.py", line 105, in run_epoch
[1,mpirank:1,algo-1]<stdout>:    for idx, data_batch in enumerate(self.dataloader):
[1,mpirank:1,algo-1]<stdout>:  File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 688, in __next__
[1,mpirank:1,algo-1]<stdout>:    (data, worker_id) = self._next_data()
[1,mpirank:1,algo-1]<stdout>:  File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1449, in _next_data
[1,mpirank:1,algo-1]<stdout>:    return (self._process_data(data), w_id)
[1,mpirank:1,algo-1]<stdout>:  File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1475, in _process_data
[1,mpirank:1,algo-1]<stdout>:    data.reraise()
[1,mpirank:1,algo-1]<stdout>:  File "/opt/conda/lib/python3.8/site-packages/torch/_utils.py", line 461, in reraise
[1,mpirank:1,algo-1]<stdout>:    raise exception
[1,mpirank:1,algo-1]<stdout>:OSError: Caught OSError in DataLoader worker process 0.
[1,mpirank:1,algo-1]<stdout>:Original Traceback (most recent call last):
[1,mpirank:1,algo-1]<stdout>:  File "/opt/conda/lib/python3.8/site-packages/mmengine/fileio/backends/local_backend.py", line 34, in get
[1,mpirank:1,algo-1]<stdout>:    value = f.read()
[1,mpirank:1,algo-1]<stdout>:ConnectionAbortedError: [Errno 103] Software caused connection abort
[1,mpirank:1,algo-1]<stdout>:

Additional information

No response

Metadata

Metadata

Assignees

Labels

BugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions