[Bug] On A100 80GB PCIe gives Cuda Illegal memory access if samples_per_gpu>x such that VRAM>30GB

### Prerequisite

- [X] I have searched [Issues](https://github.com/open-mmlab/mmdetection/issues) and [Discussions](https://github.com/open-mmlab/mmdetection/discussions) but cannot get the expected help.
- [X] I have read the [FAQ documentation](https://mmdetection.readthedocs.io/en/latest/faq.html) but cannot get the expected help.
- [X] The bug has not been fixed in the [latest version (master)](https://github.com/open-mmlab/mmdetection) or [latest version (3.x)](https://github.com/open-mmlab/mmdetection/tree/dev-3.x).

### Task

I have modified the scripts/configs, or I'm working on my own tasks/models/datasets.

### Branch

master branch https://github.com/open-mmlab/mmdetection v25_2

### Environment

```
sys.platform: linux
Python: 3.8.13 (default, May 26 2022, 00:40:00) [GCC 7.5.0]
CUDA available: True
GPU 0: NVIDIA A100 80GB PCIe
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 11.6, V11.6.124
GCC: gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
PyTorch: 1.12.0+cu116
PyTorch compiling details: PyTorch built with:
  - GCC 9.3
  - C++ Version: 201402
  - Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v2.6.0 (Git Hash 52b5f107dd9cf10910aaa19cb47f3abf9b349815)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 11.6
  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86
  - CuDNN 8.3.2  (built against CUDA 11.5)
  - Magma 2.6.1
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.6, CUDNN_VERSION=8.3.2, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.12.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=OFF, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF,

TorchVision: 0.13.0+cu116
OpenCV: 4.6.0
MMCV: 1.7.0
MMCV Compiler: GCC 7.5
MMCV CUDA Compiler: 11.6
MMDetection: 2.25.2+
```

### Reproduces the problem - code sample

```python

data = dict(
    samples_per_gpu=6,
    workers_per_gpu=1,
    train=dict(
        type='ClassBalancedDataset',
        oversample_thr=0.012,
        dataset=dict(
            type='oct_2022',
            ann_file=
            '/data/10_2022/coco_train.json',
            img_prefix=
            '/data/10_2022/v2/train_imgs/',
            filter_empty_gt=True,
            pipeline=[
                dict(type='LoadImageFromFile'),
                dict(type='LoadAnnotations', with_bbox=True),
                dict(
                    type='Resize',
                    img_scale=[(1920, 1920)],
                    keep_ratio=True),
                dict(
                    type='Normalize',
                    mean=[123.675, 116.28, 103.53],
                    std=[58.395, 57.12, 57.375],
                    to_rgb=True),
                dict(type='Pad', size_divisor=32),
                dict(
                    type='Albu',
                    transforms=[
                        dict(
                            type='RandomBrightnessContrast',
                            brightness_limit=[-0.2, 0.2],
                            contrast_limit=[-0.2, 0.2],
                            p=0.5),
                        dict(
                            type='OneOf',
                            transforms=[
                                dict(
                                    type='RGBShift',
                                    r_shift_limit=10,
                                    g_shift_limit=10,
                                    b_shift_limit=10,
                                    p=1.0),
                                dict(
                                    type='HueSaturationValue',
                                    hue_shift_limit=20,
                                    sat_shift_limit=30,
                                    val_shift_limit=20,
                                    p=1.0)
                            ],
                            p=0.1),
                        dict(type='ChannelShuffle', p=0.1),
                        dict(
                            type='OneOf',
                            transforms=[
                                dict(type='Blur', blur_limit=3, p=1.0),
                                dict(type='MedianBlur', blur_limit=3, p=1.0)
                            ],
                            p=0.1)
                    ],
                    bbox_params=dict(
                        type='BboxParams',
                        format='coco',
                        label_fields=['gt_labels'],
                        min_visibility=0.0,
                        filter_lost_elements=True),
                    keymap=dict(img='image', gt_bboxes='bboxes'),
                    update_pad_shape=False,
                    skip_img_without_anno=True),
                dict(type='DefaultFormatBundle'),
                dict(
                    type='Collect',
                    keys=['img', 'gt_bboxes', 'gt_labels'],
                    meta_keys=('filename', 'ori_shape', 'img_shape',
                               'img_norm_cfg', 'pad_shape', 'scale_factor'))
            ])),

```

### Reproduces the problem - command or script

```bash
CUDA_LAUNCH_BLOCKING=1 PYHTONPATH=./:PYTHONPATH python ./train.py /data/local/data/octavf/imagerecognition/python_modules/libraries/mmdetection_v2_25/mmdet/datasets/model_oct_2022.py --gpu-id 0 --seed 42
```

### Reproduces the problem - error message

```
 --------------------
after_run:
(VERY_LOW    ) TextLoggerHook
(VERY_LOW    ) TensorboardLoggerHook
 --------------------
2022-11-15 09:03:05,255 - mmdet - INFO - workflow: [('train', 1)], max: 12 epochs
2022-11-15 09:03:05,256 - mmdet - INFO - Checkpoints will be saved to /data/coco_workdir by HardDiskBackend.
Traceback (most recent call last):
  File "./train.py", line 248, in <module>
    main()
  File "./train.py", line 237, in main
    train_detector(
  File "/home/python_modules/libraries/mmdetection_v2_25/mmdet/apis/train.py", line 244, in train_detector
    runner.run(data_loaders, cfg.workflow)
  File "/usr/local/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 136, in run
    epoch_runner(data_loaders[i], **kwargs)
  File "/usr/local/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 53, in train
    self.run_iter(data_batch, train_mode=True, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 31, in run_iter
    outputs = self.model.train_step(data_batch, self.optimizer,
  File "/usr/local/lib/python3.8/site-packages/mmcv/parallel/data_parallel.py", line 77, in train_step
    return self.module.train_step(*inputs[0], **kwargs[0])
  File "/home/python_modules/libraries/mmdetection_v2_25/mmdet/models/detectors/base.py", line 248, in train_step
    losses = self(**data)
  File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/mmcv/runner/fp16_utils.py", line 149, in new_func
    output = old_func(*new_args, **new_kwargs)
  File "/home/python_modules/libraries/mmdetection_v2_25/mmdet/models/detectors/base.py", line 172, in forward
    return self.forward_train(img, img_metas, **kwargs)
  File "/home/python_modules/libraries/mmdetection_v2_25/mmdet/models/detectors/two_stage.py", line 135, in forward_train
    rpn_losses, proposal_list = self.rpn_head.forward_train(
  File "/home/python_modules/libraries/mmdetection_v2_25/mmdet/models/dense_heads/base_dense_head.py", line 330, in forward_train
    outs = self(x)
  File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/python_modules/libraries/mmdetection_v2_25/mmdet/models/dense_heads/guided_anchor_head.py", line 247, in forward
    return multi_apply(self.forward_single, feats)
  File "/home/python_modules/libraries/mmdetection_v2_25/mmdet/core/utils/misc.py", line 30, in multi_apply
    return tuple(map(list, zip(*map_results)))
  File "/home/python_modules/libraries/mmdetection_v2_25/mmdet/models/dense_heads/ga_rpn_head.py", line 46, in forward_single
    loc_pred) = super(GARPNHead, self).forward_single(x)
  File "/home/python_modules/libraries/mmdetection_v2_25/mmdet/models/dense_heads/guided_anchor_head.py", line 236, in forward_single
    x = self.feature_adaption(x, shape_pred)
  File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/python_modules/libraries/mmdetection_v2_25/mmdet/models/dense_heads/guided_anchor_head.py", line 56, in forward
    x = self.relu(self.conv_adaption(x, offset))
  File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/mmcv/ops/deform_conv.py", line 310, in forward
    out = deform_conv2d(x, offset, self.weight, self.stride, self.padding,
  File "/usr/local/lib/python3.8/site-packages/mmcv/ops/deform_conv.py", line 92, in forward
    ext_module.deform_conv_forward(
RuntimeError: CUDA error: an illegal memory access was encountered
terminate called after throwing an instance of 'c10::CUDAError'
  what():  CUDA error: an illegal memory access was encountered
Exception raised from create_event_internal at ../c10/cuda/CUDACachingAllocator.cpp:1387 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x3e (0x7f6f91fe61ee in /usr/local/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x26e61 (0x7f6f92060e61 in /usr/local/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x257 (0x7f6f92065db7 in /usr/local/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x466858 (0x7f6f8abd1858 in /usr/local/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #4: c10::TensorImpl::release_resources() + 0x175 (0x7f6f91fcd7a5 in /usr/local/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #5: <unknown function> + 0x362735 (0x7f6f8aacd735 in /usr/local/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #6: <unknown function> + 0x67c6c8 (0x7f6f8ade76c8 in /usr/local/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #7: THPVariable_subclass_dealloc(_object*) + 0x2d5 (0x7f6f8ade7a95 in /usr/local/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #25: __libc_start_main + 0xe7 (0x7f6f97916c87 in /lib/x86_64-linux-gnu/libc.so.6)

Aborted (core dumped)
```

### Additional information

1. I have tried several combinations: cuda 11.3/11.6; torch 1.12.0/1.12.1/1.13.0; mmcv-full 1.7.0/self-compiled(master/1.7.0)
Self-compilation of mmcv: `MMCV_WITH_OPS=1 MMCV_CUDA_ARGS='-gencode=arch=compute_80,code=sm_80' pip install -e .`
2. Sometimes it crashes in forward pass, other times in backward pass (ext_module.deform_conv_forward / ext_module.deform_conv_backward_input)
3. samples_per_gpu if set to 4, it does not crash.
4. once 33GB VRAM limit is passed, core dumps.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug] On A100 80GB PCIe gives Cuda Illegal memory access if samples_per_gpu>x such that VRAM>30GB #9325

Prerequisite

Task

Branch

Environment

Reproduces the problem - code sample

Reproduces the problem - command or script

Reproduces the problem - error message

Additional information

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug] On A100 80GB PCIe gives Cuda Illegal memory access if samples_per_gpu>x such that VRAM>30GB #9325

Description

Prerequisite

Task

Branch

Environment

Reproduces the problem - code sample

Reproduces the problem - command or script

Reproduces the problem - error message

Additional information

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions