cuda index out of bounds error while training RetinaNet

### **My Environment**
- docker: nvcr.io/nvidia/pytorch:24.01-py3
- Python 3.10.12
- cuda 12.3
- torch 2.2.0a0+81ea7a4

After sourcing the configuration file for my system and running the benchmark,
**the epoch stops in the middle of training.** 
Could it be an issue with the dataset? 
The number of downloaded train images is 1,170,301, and the number of validation images is 24,781. 


### **Errors**
Creating data loaders
Loading annotations into memory...
Done (t=42.38s)
Creating index...
index created!
Loading annotations into memory...
Done (t=1.04s)
Creating index...
index created!
:::MLLOG {"namespace": "", "time_ms": 1740528034501, "event_type": "POINT_IN_TIME", "key": "train_samples", "value": 36571, "metadata": {"file": "train.py", "lineno": 220}}
:::MLLOG {"namespace": "", "time_ms": 1740528034502, "event_type": "POINT_IN_TIME", "key": "eval_samples", "value": 775, "metadata": {"file": "train.py", "lineno": 221}}
Running ...
:::MLLOG {"namespace": "", "time_ms": 1740528034503, "event_type": "INTERVAL_START", "key": "epoch_start", "value": 0, "metadata": {"file": "engine.py", "lineno": 15, "epoch_num": 0}}
/usr/local/lib/python3.10/dist-packages/torch/functional.py:507: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at /opt/pytorch/pytorch/aten/src/ATen/native/TensorShape.cpp:3549.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]

_Epoch: [0]  [    0/36571]  eta: 1 day, 22:50:54  lr: 0.000000  loss: 2.2681 (2.2681)  classification: 1.5570 (1.5570)  bbox_regression: 0.7112 (0.7112)  time: 4.6117  data: 2.3579  max mem: 51676
Epoch: [0]  [   20/36571]  eta: 6:43:28  lr: 0.000000  loss: 2.1965 (2.2525)  classification: 1.4899 (1.5364)  bbox_regression: 0.7040 (0.7162)  time: 0.4649  data: 0.0004  max mem: 52126
Epoch: [0]  [   40/36571]  eta: 5:44:49  lr: 0.000000  loss: 2.1948 (2.2442)  classification: 1.4947 (1.5284)  bbox_regression: 0.6966 (0.7158)  time: 0.4656  data: 0.0004  max mem: 52126
Epoch: [0]  [   60/36571]  eta: 5:24:05  lr: 0.000000  loss: 2.2333 (2.2632)  classification: 1.5102 (1.5471)  bbox_regression: 0.7020 (0.7160)  time: 0.4634  data: 0.0004  max mem: 52126
Epoch: [0]  [   80/36571]  eta: 5:13:44  lr: 0.000000  loss: 2.1976 (2.2609)  classification: 1.4952 (1.5471)  bbox_regression: 0.7035 (0.7138)  time: 0.4649  data: 0.0005  max mem: 52126
Epoch: [0]  [  100/36571]  eta: 5:07:41  lr: 0.000000  loss: 2.2347 (2.2632)  classification: 1.5412 (1.5491)  bbox_regression: 0.7127 (0.7141)  time: 0.4670  data: 0.0005  max mem: 52126
Epoch: [0]  [  120/36571]  eta: 5:03:12  lr: 0.000000  loss: 2.2351 (2.2656)  classification: 1.5331 (1.5520)  bbox_regression: 0.6994 (0.7136)  time: 0.4632  data: 0.0005  max mem: 52126_

**/opt/pytorch/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [40,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [41,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [42,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.**

**Traceback (most recent call last):
  File "/workspace/ssd/train.py", line 266, in <module>
    main(args)
  File "/workspace/ssd/train.py", line 235, in main
    train_one_epoch(model, optimizer, scaler, data_loader, device, epoch, args)
  File "/workspace/ssd/engine.py", line 35, in train_one_epoch
    loss_dict = model(images, targets)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1510, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1519, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/distributed.py", line 1509, in forward
    else self._run_ddp_forward(*inputs, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/distributed.py", line 1345, in _run_ddp_forward
    return self.module(*inputs, **kwargs)  # type: ignore[index]
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1510, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1519, in _call_impl
    return forward_call(*args, **kwargs)
  File "/workspace/ssd/model/retinanet.py", line 552, in forward
    losses = self.compute_loss(targets, head_outputs, anchors)
  File "/workspace/ssd/model/retinanet.py", line 413, in compute_loss
    return self.head.compute_loss(targets, head_outputs, anchors, matched_idxs)
  File "/workspace/ssd/model/retinanet.py", line 57, in compute_loss
    'classification': self.classification_head.compute_loss(targets, head_outputs, matched_idxs),
  File "/workspace/ssd/model/retinanet.py", line 122, in compute_loss
    gt_classes_target[
RuntimeError: CUDA error: device-side assert triggered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
[rank0]:[E ProcessGroupNCCL.cpp:1282] [Rank 0] NCCL watchdog thread terminated with exception: CUDA error: device-side assert triggered**



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cuda index out of bounds error while training RetinaNet #785

My Environment

Errors

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

cuda index out of bounds error while training RetinaNet #785

Description

My Environment

Errors

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions