Skip to content

[Bug] training loss is nan after i changed the anchor ratios and scales? #10041

Open
@alexHxun

Description

@alexHxun

Prerequisite

Task

I'm using the official example scripts/configs for the officially supported tasks/models/datasets.

Branch

master branch https://github.com/open-mmlab/mmdetection

Environment

packages in environment at E:\miniconda3\envs\mmdet:

Name Version Build Channel

addict 2.4.0 pypi_0 pypi
bzip2 1.0.8 h8ffe710_4 https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
ca-certificates 2022.12.7 h5b45459_0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
certifi 2022.12.7 pypi_0 pypi
charset-normalizer 3.1.0 pypi_0 pypi
click 8.1.3 pypi_0 pypi
colorama 0.4.6 pypi_0 pypi
coloredlogs 15.0.1 pypi_0 pypi
contourpy 1.0.7 pypi_0 pypi
cycler 0.11.0 pypi_0 pypi
flatbuffers 23.3.3 pypi_0 pypi
fonttools 4.39.0 pypi_0 pypi
humanfriendly 10.0 pypi_0 pypi
idna 3.4 pypi_0 pypi
importlib-metadata 6.0.0 pypi_0 pypi
importlib-resources 5.12.0 pypi_0 pypi
joblib 1.2.0 pypi_0 pypi
kiwisolver 1.4.4 pypi_0 pypi
libffi 3.4.2 h8ffe710_5 https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
libsqlite 3.40.0 hcfcfb64_0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
libzlib 1.2.13 hcfcfb64_4 https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
markdown 3.4.1 pypi_0 pypi
markdown-it-py 2.2.0 pypi_0 pypi
matplotlib 3.7.1 pypi_0 pypi
mdurl 0.1.2 pypi_0 pypi
mmcv-full 1.7.1 pypi_0 pypi
mmdet 2.28.2 pypi_0 pypi
model-index 0.1.11 pypi_0 pypi
mpmath 1.3.0 pypi_0 pypi
numpy 1.20.0 pypi_0 pypi
onnx 1.8.1 pypi_0 pypi
onnxoptimizer 0.3.9 pypi_0 pypi
onnxruntime 1.14.1 pypi_0 pypi
opencv-python 4.7.0.72 pypi_0 pypi
openmim 0.3.6 pypi_0 pypi
openssl 3.0.8 hcfcfb64_0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
ordered-set 4.1.0 pypi_0 pypi
packaging 23.0 pypi_0 pypi
pandas 1.5.3 pypi_0 pypi
pillow 9.4.0 pypi_0 pypi
pip 23.0.1 pyhd8ed1ab_0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
protobuf 3.20.0 pypi_0 pypi
pycocotools 2.0.6 pypi_0 pypi
pygments 2.14.0 pypi_0 pypi
pymupdf 1.21.1 pypi_0 pypi
pyparsing 3.0.9 pypi_0 pypi
pyreadline3 3.4.1 pypi_0 pypi
python 3.8.16 h4de0772_1_cpython https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
python-dateutil 2.8.2 pypi_0 pypi
pytz 2022.7.1 pypi_0 pypi
pyyaml 6.0 pypi_0 pypi
regex 2022.10.31 pypi_0 pypi
requests 2.28.2 pypi_0 pypi
rich 13.3.2 pypi_0 pypi
scikit-learn 1.2.2 pypi_0 pypi
scipy 1.10.1 pypi_0 pypi
setuptools 67.6.0 pyhd8ed1ab_0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
six 1.16.0 pypi_0 pypi
sklearn 0.0.post1 pypi_0 pypi
sympy 1.11.1 pypi_0 pypi
tabulate 0.9.0 pypi_0 pypi
terminaltables 3.1.10 pypi_0 pypi
threadpoolctl 3.1.0 pypi_0 pypi
tk 8.6.12 h8ffe710_0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
torch 1.10.0+cu113 pypi_0 pypi
torchaudio 0.10.0+cu113 pypi_0 pypi
torchvision 0.11.0+cu113 pypi_0 pypi
tqdm 4.65.0 pypi_0 pypi
typing-extensions 4.5.0 pypi_0 pypi
ucrt 10.0.22621.0 h57928b3_0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
urllib3 1.26.15 pypi_0 pypi
vc 14.3 hb6edc58_10 https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
vs2015_runtime 14.34.31931 h4c5c07a_10 https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
wheel 0.38.4 pyhd8ed1ab_0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
xz 5.2.6 h8d14728_0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
yapf 0.32.0 pypi_0 pypi
zipp 3.15.0 pypi_0 pypi

Reproduces the problem - code sample

rpn_head=dict(
    type='RPNHead',
    in_channels=256,
    feat_channels=256,
    anchor_generator=dict(
        type='AnchorGenerator',
        scales=[8,16,32],
        ratios=[0.2,0.5,1.0, 2.0,5.0],
        strides=[4, 8, 16, 32, 64]),

Reproduces the problem - command or script

python train.py

Reproduces the problem - error message

2023-03-28 18:53:29,331 - mmdet - INFO - workflow: [('train', 1)], max: 12 epochs
2023-03-28 18:53:29,331 - mmdet - INFO - Checkpoints will be saved to D:\dev\output by HardDiskBackend.
E:\miniconda3\envs\mmdet\lib\site-packages\mmcv_init_.py:20: UserWarning: On January 1, 2023, MMCV will release v2.0.0, in which it will remove components related to the training process and add a data transformation module. In addition, it will rename the package names mmcv to mmcv-lite and mmcv-full to mmcv. See https://github.com/open-mmlab/mmcv/blob/master/docs/en/compatibility.md for more details.
warnings.warn(
E:\miniconda3\envs\mmdet\lib\site-packages\mmcv_init_.py:20: UserWarning: On January 1, 2023, MMCV will release v2.0.0, in which it will remove components related to the training process and add a data transformation module. In addition, it will rename the package names mmcv to mmcv-lite and mmcv-full to mmcv. See https://github.com/open-mmlab/mmcv/blob/master/docs/en/compatibility.md for more details.
warnings.warn(
2023-03-28 18:56:24,462 - mmdet - INFO - Epoch [1][50/167822] lr: 1.978e-03, eta: 81 days, 14:14:02, time: 3.501, data_time: 1.926, memory: 7544, loss_rpn_cls: 0.6461, loss_rpn_bbox: 0.3523, loss_cls: 0.3924, acc: 92.3589, loss_bbox: 0.0764, loss: 1.4672
2023-03-28 18:57:00,375 - mmdet - INFO - Epoch [1][100/167822] lr: 3.976e-03, eta: 49 days, 4:27:43, time: 0.720, data_time: 0.003, memory: 7544, loss_rpn_cls: 0.4091, loss_rpn_bbox: 0.3576, loss_cls: 0.2562, acc: 93.8356, loss_bbox: 0.1806, loss: 1.2035
2023-03-28 18:57:36,116 - mmdet - INFO - Epoch [1][150/167822] lr: 5.974e-03, eta: 38 days, 8:13:45, time: 0.715, data_time: 0.001, memory: 7544, loss_rpn_cls: 0.3796, loss_rpn_bbox: 0.4333, loss_cls: 0.3310, acc: 93.2238, loss_bbox: 0.2134, loss: 1.3574
2023-03-28 18:58:08,945 - mmdet - INFO - Epoch [1][200/167822] lr: 7.972e-03, eta: 32 days, 13:59:55, time: 0.657, data_time: 0.001, memory: 7544, loss_rpn_cls: nan, loss_rpn_bbox: nan, loss_cls: nan, acc: 76.2948, loss_bbox: nan, loss: nan
2023-03-28 18:58:41,679 - mmdet - INFO - Epoch [1][250/167822] lr: 9.970e-03, eta: 29 days, 2:49:15, time: 0.655, data_time: 0.001, memory: 7544, loss_rpn_cls: nan, loss_rpn_bbox: nan, loss_cls: nan, acc: 76.1290, loss_bbox: nan, loss: nan
2023-03-28 18:59:14,312 - mmdet - INFO - Epoch [1][300/167822] lr: 1.197e-02, eta: 26 days, 19:10:01, time: 0.653, data_time: 0.001, memory: 7544, loss_rpn_cls: nan, loss_rpn_bbox: nan, loss_cls: nan, acc: 76.1878, loss_bbox: nan, loss: nan

Additional information

what i did is add the scales [8] to [8,16,32] and modify the ratios [0.5,1.0,2.0] to [0.2,0.5,1.0, 2.0,5.0];

i didn`t find a clear guide how to tune the anchor size and numbers, i just tried to modify like that but got errors.

And if i don`t change the anchor settings, the training is normal

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions