Open
Description
Prerequisite
- I have searched Issues and Discussions but cannot get the expected help.
- I have read the FAQ documentation but cannot get the expected help.
- The bug has not been fixed in the latest version (master) or latest version (3.x).
Task
I have modified the scripts/configs, or I'm working on my own tasks/models/datasets.
Branch
master branch https://github.com/open-mmlab/mmdetection v25_2
Environment
sys.platform: linux
Python: 3.8.13 (default, May 26 2022, 00:40:00) [GCC 7.5.0]
CUDA available: True
GPU 0: NVIDIA A100 80GB PCIe
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 11.6, V11.6.124
GCC: gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
PyTorch: 1.12.0+cu116
PyTorch compiling details: PyTorch built with:
- GCC 9.3
- C++ Version: 201402
- Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications
- Intel(R) MKL-DNN v2.6.0 (Git Hash 52b5f107dd9cf10910aaa19cb47f3abf9b349815)
- OpenMP 201511 (a.k.a. OpenMP 4.5)
- LAPACK is enabled (usually provided by MKL)
- NNPACK is enabled
- CPU capability usage: AVX2
- CUDA Runtime 11.6
- NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86
- CuDNN 8.3.2 (built against CUDA 11.5)
- Magma 2.6.1
- Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.6, CUDNN_VERSION=8.3.2, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.12.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=OFF, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF,
TorchVision: 0.13.0+cu116
OpenCV: 4.6.0
MMCV: 1.7.0
MMCV Compiler: GCC 7.5
MMCV CUDA Compiler: 11.6
MMDetection: 2.25.2+
Reproduces the problem - code sample
data = dict(
samples_per_gpu=6,
workers_per_gpu=1,
train=dict(
type='ClassBalancedDataset',
oversample_thr=0.012,
dataset=dict(
type='oct_2022',
ann_file=
'/data/10_2022/coco_train.json',
img_prefix=
'/data/10_2022/v2/train_imgs/',
filter_empty_gt=True,
pipeline=[
dict(type='LoadImageFromFile'),
dict(type='LoadAnnotations', with_bbox=True),
dict(
type='Resize',
img_scale=[(1920, 1920)],
keep_ratio=True),
dict(
type='Normalize',
mean=[123.675, 116.28, 103.53],
std=[58.395, 57.12, 57.375],
to_rgb=True),
dict(type='Pad', size_divisor=32),
dict(
type='Albu',
transforms=[
dict(
type='RandomBrightnessContrast',
brightness_limit=[-0.2, 0.2],
contrast_limit=[-0.2, 0.2],
p=0.5),
dict(
type='OneOf',
transforms=[
dict(
type='RGBShift',
r_shift_limit=10,
g_shift_limit=10,
b_shift_limit=10,
p=1.0),
dict(
type='HueSaturationValue',
hue_shift_limit=20,
sat_shift_limit=30,
val_shift_limit=20,
p=1.0)
],
p=0.1),
dict(type='ChannelShuffle', p=0.1),
dict(
type='OneOf',
transforms=[
dict(type='Blur', blur_limit=3, p=1.0),
dict(type='MedianBlur', blur_limit=3, p=1.0)
],
p=0.1)
],
bbox_params=dict(
type='BboxParams',
format='coco',
label_fields=['gt_labels'],
min_visibility=0.0,
filter_lost_elements=True),
keymap=dict(img='image', gt_bboxes='bboxes'),
update_pad_shape=False,
skip_img_without_anno=True),
dict(type='DefaultFormatBundle'),
dict(
type='Collect',
keys=['img', 'gt_bboxes', 'gt_labels'],
meta_keys=('filename', 'ori_shape', 'img_shape',
'img_norm_cfg', 'pad_shape', 'scale_factor'))
])),
Reproduces the problem - command or script
CUDA_LAUNCH_BLOCKING=1 PYHTONPATH=./:PYTHONPATH python ./train.py /data/local/data/octavf/imagerecognition/python_modules/libraries/mmdetection_v2_25/mmdet/datasets/model_oct_2022.py --gpu-id 0 --seed 42
Reproduces the problem - error message
--------------------
after_run:
(VERY_LOW ) TextLoggerHook
(VERY_LOW ) TensorboardLoggerHook
--------------------
2022-11-15 09:03:05,255 - mmdet - INFO - workflow: [('train', 1)], max: 12 epochs
2022-11-15 09:03:05,256 - mmdet - INFO - Checkpoints will be saved to /data/coco_workdir by HardDiskBackend.
Traceback (most recent call last):
File "./train.py", line 248, in <module>
main()
File "./train.py", line 237, in main
train_detector(
File "/home/python_modules/libraries/mmdetection_v2_25/mmdet/apis/train.py", line 244, in train_detector
runner.run(data_loaders, cfg.workflow)
File "/usr/local/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 136, in run
epoch_runner(data_loaders[i], **kwargs)
File "/usr/local/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 53, in train
self.run_iter(data_batch, train_mode=True, **kwargs)
File "/usr/local/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 31, in run_iter
outputs = self.model.train_step(data_batch, self.optimizer,
File "/usr/local/lib/python3.8/site-packages/mmcv/parallel/data_parallel.py", line 77, in train_step
return self.module.train_step(*inputs[0], **kwargs[0])
File "/home/python_modules/libraries/mmdetection_v2_25/mmdet/models/detectors/base.py", line 248, in train_step
losses = self(**data)
File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/usr/local/lib/python3.8/site-packages/mmcv/runner/fp16_utils.py", line 149, in new_func
output = old_func(*new_args, **new_kwargs)
File "/home/python_modules/libraries/mmdetection_v2_25/mmdet/models/detectors/base.py", line 172, in forward
return self.forward_train(img, img_metas, **kwargs)
File "/home/python_modules/libraries/mmdetection_v2_25/mmdet/models/detectors/two_stage.py", line 135, in forward_train
rpn_losses, proposal_list = self.rpn_head.forward_train(
File "/home/python_modules/libraries/mmdetection_v2_25/mmdet/models/dense_heads/base_dense_head.py", line 330, in forward_train
outs = self(x)
File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/python_modules/libraries/mmdetection_v2_25/mmdet/models/dense_heads/guided_anchor_head.py", line 247, in forward
return multi_apply(self.forward_single, feats)
File "/home/python_modules/libraries/mmdetection_v2_25/mmdet/core/utils/misc.py", line 30, in multi_apply
return tuple(map(list, zip(*map_results)))
File "/home/python_modules/libraries/mmdetection_v2_25/mmdet/models/dense_heads/ga_rpn_head.py", line 46, in forward_single
loc_pred) = super(GARPNHead, self).forward_single(x)
File "/home/python_modules/libraries/mmdetection_v2_25/mmdet/models/dense_heads/guided_anchor_head.py", line 236, in forward_single
x = self.feature_adaption(x, shape_pred)
File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/python_modules/libraries/mmdetection_v2_25/mmdet/models/dense_heads/guided_anchor_head.py", line 56, in forward
x = self.relu(self.conv_adaption(x, offset))
File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/usr/local/lib/python3.8/site-packages/mmcv/ops/deform_conv.py", line 310, in forward
out = deform_conv2d(x, offset, self.weight, self.stride, self.padding,
File "/usr/local/lib/python3.8/site-packages/mmcv/ops/deform_conv.py", line 92, in forward
ext_module.deform_conv_forward(
RuntimeError: CUDA error: an illegal memory access was encountered
terminate called after throwing an instance of 'c10::CUDAError'
what(): CUDA error: an illegal memory access was encountered
Exception raised from create_event_internal at ../c10/cuda/CUDACachingAllocator.cpp:1387 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x3e (0x7f6f91fe61ee in /usr/local/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x26e61 (0x7f6f92060e61 in /usr/local/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x257 (0x7f6f92065db7 in /usr/local/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x466858 (0x7f6f8abd1858 in /usr/local/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #4: c10::TensorImpl::release_resources() + 0x175 (0x7f6f91fcd7a5 in /usr/local/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #5: <unknown function> + 0x362735 (0x7f6f8aacd735 in /usr/local/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #6: <unknown function> + 0x67c6c8 (0x7f6f8ade76c8 in /usr/local/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #7: THPVariable_subclass_dealloc(_object*) + 0x2d5 (0x7f6f8ade7a95 in /usr/local/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #25: __libc_start_main + 0xe7 (0x7f6f97916c87 in /lib/x86_64-linux-gnu/libc.so.6)
Aborted (core dumped)
Additional information
- I have tried several combinations: cuda 11.3/11.6; torch 1.12.0/1.12.1/1.13.0; mmcv-full 1.7.0/self-compiled(master/1.7.0)
Self-compilation of mmcv:MMCV_WITH_OPS=1 MMCV_CUDA_ARGS='-gencode=arch=compute_80,code=sm_80' pip install -e .
- Sometimes it crashes in forward pass, other times in backward pass (ext_module.deform_conv_forward / ext_module.deform_conv_backward_input)
- samples_per_gpu if set to 4, it does not crash.
- once 33GB VRAM limit is passed, core dumps.