Description
I am trying to fine-tune RF-DETR Medium (COCO pretrained) on the BDD100K detection dataset. Training starts normally but eventually crashes with CUDA OOM, even though GPU usage initially stabilizes.
The crash happens at different iterations in different runs (e.g. step ~400 or ~1782).
Environment:
GPU: NVIDIA L40S (48GB)
CUDA: 12.x
PyTorch: 2.x
RF-DETR: latest version (pip install rfdetr)[1.5.2](Also tried with 1.5.1)
Python: 3.10
OS: Ubuntu
Dataset:
Dataset: BDD100K detection
Train images: ~70k
Val images: ~10k
Classes: 10
Model
Model: rf_detr_medium
Pretrained: COCO
Queries: 300
Train Config
training:
epochs: 40
batch_size: 16
grad_accum_steps: 1
num_workers: 8
output_dir: outputs/rf_detr_bdd100k
optimizer:
lr: 1.0e-4
lr_encoder: 1.0e-5
weight_decay: 1.0e-4
Observed Behavior:
Training starts normally and GPU memory usage increases gradually.
GPU utilization graph shows memory stabilizing near 98% before crash.
Things I tried:
RFDETRMedium - Batch size:32, Grad accum: 1 - Crashed OOM after Epoch 1 (in between)
RFDETRMedium - Batch size: 16, Grad accum: 2 -Crashed OOM after Epoch 8 (in between)
RFDETRSmall - Batch size: 16, Grad accum:1 - Crashed OOM after epoch 3