-
Notifications
You must be signed in to change notification settings - Fork 585
Description
My Environment
- docker: nvcr.io/nvidia/pytorch:24.01-py3
- Python 3.10.12
- cuda 12.3
- torch 2.2.0a0+81ea7a4
After sourcing the configuration file for my system and running the benchmark,
the epoch stops in the middle of training.
Could it be an issue with the dataset?
The number of downloaded train images is 1,170,301, and the number of validation images is 24,781.
Errors
Creating data loaders
Loading annotations into memory...
Done (t=42.38s)
Creating index...
index created!
Loading annotations into memory...
Done (t=1.04s)
Creating index...
index created!
:::MLLOG {"namespace": "", "time_ms": 1740528034501, "event_type": "POINT_IN_TIME", "key": "train_samples", "value": 36571, "metadata": {"file": "train.py", "lineno": 220}}
:::MLLOG {"namespace": "", "time_ms": 1740528034502, "event_type": "POINT_IN_TIME", "key": "eval_samples", "value": 775, "metadata": {"file": "train.py", "lineno": 221}}
Running ...
:::MLLOG {"namespace": "", "time_ms": 1740528034503, "event_type": "INTERVAL_START", "key": "epoch_start", "value": 0, "metadata": {"file": "engine.py", "lineno": 15, "epoch_num": 0}}
/usr/local/lib/python3.10/dist-packages/torch/functional.py:507: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at /opt/pytorch/pytorch/aten/src/ATen/native/TensorShape.cpp:3549.)
return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined]
Epoch: [0] [ 0/36571] eta: 1 day, 22:50:54 lr: 0.000000 loss: 2.2681 (2.2681) classification: 1.5570 (1.5570) bbox_regression: 0.7112 (0.7112) time: 4.6117 data: 2.3579 max mem: 51676
Epoch: [0] [ 20/36571] eta: 6:43:28 lr: 0.000000 loss: 2.1965 (2.2525) classification: 1.4899 (1.5364) bbox_regression: 0.7040 (0.7162) time: 0.4649 data: 0.0004 max mem: 52126
Epoch: [0] [ 40/36571] eta: 5:44:49 lr: 0.000000 loss: 2.1948 (2.2442) classification: 1.4947 (1.5284) bbox_regression: 0.6966 (0.7158) time: 0.4656 data: 0.0004 max mem: 52126
Epoch: [0] [ 60/36571] eta: 5:24:05 lr: 0.000000 loss: 2.2333 (2.2632) classification: 1.5102 (1.5471) bbox_regression: 0.7020 (0.7160) time: 0.4634 data: 0.0004 max mem: 52126
Epoch: [0] [ 80/36571] eta: 5:13:44 lr: 0.000000 loss: 2.1976 (2.2609) classification: 1.4952 (1.5471) bbox_regression: 0.7035 (0.7138) time: 0.4649 data: 0.0005 max mem: 52126
Epoch: [0] [ 100/36571] eta: 5:07:41 lr: 0.000000 loss: 2.2347 (2.2632) classification: 1.5412 (1.5491) bbox_regression: 0.7127 (0.7141) time: 0.4670 data: 0.0005 max mem: 52126
Epoch: [0] [ 120/36571] eta: 5:03:12 lr: 0.000000 loss: 2.2351 (2.2656) classification: 1.5331 (1.5520) bbox_regression: 0.6994 (0.7136) time: 0.4632 data: 0.0005 max mem: 52126
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [40,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [41,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [42,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed.
**Traceback (most recent call last):
File "/workspace/ssd/train.py", line 266, in
main(args)
File "/workspace/ssd/train.py", line 235, in main
train_one_epoch(model, optimizer, scaler, data_loader, device, epoch, args)
File "/workspace/ssd/engine.py", line 35, in train_one_epoch
loss_dict = model(images, targets)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1510, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1519, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/distributed.py", line 1509, in forward
else self._run_ddp_forward(*inputs, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/distributed.py", line 1345, in _run_ddp_forward
return self.module(*inputs, **kwargs) # type: ignore[index]
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1510, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1519, in _call_impl
return forward_call(*args, kwargs)
File "/workspace/ssd/model/retinanet.py", line 552, in forward
losses = self.compute_loss(targets, head_outputs, anchors)
File "/workspace/ssd/model/retinanet.py", line 413, in compute_loss
return self.head.compute_loss(targets, head_outputs, anchors, matched_idxs)
File "/workspace/ssd/model/retinanet.py", line 57, in compute_loss
'classification': self.classification_head.compute_loss(targets, head_outputs, matched_idxs),
File "/workspace/ssd/model/retinanet.py", line 122, in compute_loss
gt_classes_target[
RuntimeError: CUDA error: device-side assert triggered
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
[rank0]:[E ProcessGroupNCCL.cpp:1282] [Rank 0] NCCL watchdog thread terminated with exception: CUDA error: device-side assert triggered