Skip to content

[Bug] On A100 80GB PCIe gives Cuda Illegal memory access if samples_per_gpu>x such that VRAM>30GB #9325

Open
@octavflorescu

Description

@octavflorescu

Prerequisite

Task

I have modified the scripts/configs, or I'm working on my own tasks/models/datasets.

Branch

master branch https://github.com/open-mmlab/mmdetection v25_2

Environment

sys.platform: linux
Python: 3.8.13 (default, May 26 2022, 00:40:00) [GCC 7.5.0]
CUDA available: True
GPU 0: NVIDIA A100 80GB PCIe
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 11.6, V11.6.124
GCC: gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
PyTorch: 1.12.0+cu116
PyTorch compiling details: PyTorch built with:
  - GCC 9.3
  - C++ Version: 201402
  - Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v2.6.0 (Git Hash 52b5f107dd9cf10910aaa19cb47f3abf9b349815)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 11.6
  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86
  - CuDNN 8.3.2  (built against CUDA 11.5)
  - Magma 2.6.1
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.6, CUDNN_VERSION=8.3.2, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.12.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=OFF, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF,

TorchVision: 0.13.0+cu116
OpenCV: 4.6.0
MMCV: 1.7.0
MMCV Compiler: GCC 7.5
MMCV CUDA Compiler: 11.6
MMDetection: 2.25.2+

Reproduces the problem - code sample

data = dict(
    samples_per_gpu=6,
    workers_per_gpu=1,
    train=dict(
        type='ClassBalancedDataset',
        oversample_thr=0.012,
        dataset=dict(
            type='oct_2022',
            ann_file=
            '/data/10_2022/coco_train.json',
            img_prefix=
            '/data/10_2022/v2/train_imgs/',
            filter_empty_gt=True,
            pipeline=[
                dict(type='LoadImageFromFile'),
                dict(type='LoadAnnotations', with_bbox=True),
                dict(
                    type='Resize',
                    img_scale=[(1920, 1920)],
                    keep_ratio=True),
                dict(
                    type='Normalize',
                    mean=[123.675, 116.28, 103.53],
                    std=[58.395, 57.12, 57.375],
                    to_rgb=True),
                dict(type='Pad', size_divisor=32),
                dict(
                    type='Albu',
                    transforms=[
                        dict(
                            type='RandomBrightnessContrast',
                            brightness_limit=[-0.2, 0.2],
                            contrast_limit=[-0.2, 0.2],
                            p=0.5),
                        dict(
                            type='OneOf',
                            transforms=[
                                dict(
                                    type='RGBShift',
                                    r_shift_limit=10,
                                    g_shift_limit=10,
                                    b_shift_limit=10,
                                    p=1.0),
                                dict(
                                    type='HueSaturationValue',
                                    hue_shift_limit=20,
                                    sat_shift_limit=30,
                                    val_shift_limit=20,
                                    p=1.0)
                            ],
                            p=0.1),
                        dict(type='ChannelShuffle', p=0.1),
                        dict(
                            type='OneOf',
                            transforms=[
                                dict(type='Blur', blur_limit=3, p=1.0),
                                dict(type='MedianBlur', blur_limit=3, p=1.0)
                            ],
                            p=0.1)
                    ],
                    bbox_params=dict(
                        type='BboxParams',
                        format='coco',
                        label_fields=['gt_labels'],
                        min_visibility=0.0,
                        filter_lost_elements=True),
                    keymap=dict(img='image', gt_bboxes='bboxes'),
                    update_pad_shape=False,
                    skip_img_without_anno=True),
                dict(type='DefaultFormatBundle'),
                dict(
                    type='Collect',
                    keys=['img', 'gt_bboxes', 'gt_labels'],
                    meta_keys=('filename', 'ori_shape', 'img_shape',
                               'img_norm_cfg', 'pad_shape', 'scale_factor'))
            ])),

Reproduces the problem - command or script

CUDA_LAUNCH_BLOCKING=1 PYHTONPATH=./:PYTHONPATH python ./train.py /data/local/data/octavf/imagerecognition/python_modules/libraries/mmdetection_v2_25/mmdet/datasets/model_oct_2022.py --gpu-id 0 --seed 42

Reproduces the problem - error message

 --------------------
after_run:
(VERY_LOW    ) TextLoggerHook
(VERY_LOW    ) TensorboardLoggerHook
 --------------------
2022-11-15 09:03:05,255 - mmdet - INFO - workflow: [('train', 1)], max: 12 epochs
2022-11-15 09:03:05,256 - mmdet - INFO - Checkpoints will be saved to /data/coco_workdir by HardDiskBackend.
Traceback (most recent call last):
  File "./train.py", line 248, in <module>
    main()
  File "./train.py", line 237, in main
    train_detector(
  File "/home/python_modules/libraries/mmdetection_v2_25/mmdet/apis/train.py", line 244, in train_detector
    runner.run(data_loaders, cfg.workflow)
  File "/usr/local/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 136, in run
    epoch_runner(data_loaders[i], **kwargs)
  File "/usr/local/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 53, in train
    self.run_iter(data_batch, train_mode=True, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 31, in run_iter
    outputs = self.model.train_step(data_batch, self.optimizer,
  File "/usr/local/lib/python3.8/site-packages/mmcv/parallel/data_parallel.py", line 77, in train_step
    return self.module.train_step(*inputs[0], **kwargs[0])
  File "/home/python_modules/libraries/mmdetection_v2_25/mmdet/models/detectors/base.py", line 248, in train_step
    losses = self(**data)
  File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/mmcv/runner/fp16_utils.py", line 149, in new_func
    output = old_func(*new_args, **new_kwargs)
  File "/home/python_modules/libraries/mmdetection_v2_25/mmdet/models/detectors/base.py", line 172, in forward
    return self.forward_train(img, img_metas, **kwargs)
  File "/home/python_modules/libraries/mmdetection_v2_25/mmdet/models/detectors/two_stage.py", line 135, in forward_train
    rpn_losses, proposal_list = self.rpn_head.forward_train(
  File "/home/python_modules/libraries/mmdetection_v2_25/mmdet/models/dense_heads/base_dense_head.py", line 330, in forward_train
    outs = self(x)
  File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/python_modules/libraries/mmdetection_v2_25/mmdet/models/dense_heads/guided_anchor_head.py", line 247, in forward
    return multi_apply(self.forward_single, feats)
  File "/home/python_modules/libraries/mmdetection_v2_25/mmdet/core/utils/misc.py", line 30, in multi_apply
    return tuple(map(list, zip(*map_results)))
  File "/home/python_modules/libraries/mmdetection_v2_25/mmdet/models/dense_heads/ga_rpn_head.py", line 46, in forward_single
    loc_pred) = super(GARPNHead, self).forward_single(x)
  File "/home/python_modules/libraries/mmdetection_v2_25/mmdet/models/dense_heads/guided_anchor_head.py", line 236, in forward_single
    x = self.feature_adaption(x, shape_pred)
  File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/python_modules/libraries/mmdetection_v2_25/mmdet/models/dense_heads/guided_anchor_head.py", line 56, in forward
    x = self.relu(self.conv_adaption(x, offset))
  File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/mmcv/ops/deform_conv.py", line 310, in forward
    out = deform_conv2d(x, offset, self.weight, self.stride, self.padding,
  File "/usr/local/lib/python3.8/site-packages/mmcv/ops/deform_conv.py", line 92, in forward
    ext_module.deform_conv_forward(
RuntimeError: CUDA error: an illegal memory access was encountered
terminate called after throwing an instance of 'c10::CUDAError'
  what():  CUDA error: an illegal memory access was encountered
Exception raised from create_event_internal at ../c10/cuda/CUDACachingAllocator.cpp:1387 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x3e (0x7f6f91fe61ee in /usr/local/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x26e61 (0x7f6f92060e61 in /usr/local/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x257 (0x7f6f92065db7 in /usr/local/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x466858 (0x7f6f8abd1858 in /usr/local/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #4: c10::TensorImpl::release_resources() + 0x175 (0x7f6f91fcd7a5 in /usr/local/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #5: <unknown function> + 0x362735 (0x7f6f8aacd735 in /usr/local/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #6: <unknown function> + 0x67c6c8 (0x7f6f8ade76c8 in /usr/local/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #7: THPVariable_subclass_dealloc(_object*) + 0x2d5 (0x7f6f8ade7a95 in /usr/local/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #25: __libc_start_main + 0xe7 (0x7f6f97916c87 in /lib/x86_64-linux-gnu/libc.so.6)

Aborted (core dumped)

Additional information

  1. I have tried several combinations: cuda 11.3/11.6; torch 1.12.0/1.12.1/1.13.0; mmcv-full 1.7.0/self-compiled(master/1.7.0)
    Self-compilation of mmcv: MMCV_WITH_OPS=1 MMCV_CUDA_ARGS='-gencode=arch=compute_80,code=sm_80' pip install -e .
  2. Sometimes it crashes in forward pass, other times in backward pass (ext_module.deform_conv_forward / ext_module.deform_conv_backward_input)
  3. samples_per_gpu if set to 4, it does not crash.
  4. once 33GB VRAM limit is passed, core dumps.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions