Skip to content

RF-DETR v1.6.4 Regression: 2x Slower Training, Logging Issues, and Missing GPU Memory Bar #974

@Wavelet303

Description

@Wavelet303

Search before asking

  • I have searched the RF-DETR issues and found no similar bug report.

Bug: RF-DETR v1.6.4: 2x Slower Training, Logging Issues, and Missing GPU Memory Bar

Hi,

I'm currently training an instance segmentation model using RF-DETR v1.6.4 and I'm experiencing several issues compared to version 1.5.0.

Dataset

  • ~18,000 images
  • Resolution: 512x512

Issues

1. GPU Memory Bar Not Showing

The GPU memory usage bar that was visible in previous versions is no longer displayed during training.

2. Logging Issues

  • A log.txt file is generated, but it only contains information for a single epoch.
  • hparams.yaml is created but remains empty.

3. Significant Performance Drop (Critical)

Training with v1.6.4 is approximately 2x slower than v1.5.0 under the same conditions:

  • Same dataset
  • Same hardware
  • Similar training configuration

This is the most concerning issue.

Additional Notes

  • No major dataset or hardware changes were introduced between versions.
  • The issues appeared after upgrading from v1.5.0 to v1.6.4.

Possible Hypotheses

I am not sure whether the slowdown is caused by one of the following:

  • A change in how num_workers is determined internally
  • Changes in the augmentation pipeline, especially when using AUG_INDUSTRIAL
  • A change in how the effective batch size is computed or handled internally
  • Possible changes related to PyTorch Lightning (e.g., trainer behavior, logging, or performance overhead)

Questions

  • Has anyone experienced similar issues with v1.6.4?
  • Are there known regressions affecting performance or logging?
  • Has anything changed internally regarding num_workers, augmentations, batch size handling, or PyTorch Lightning integration?
  • Could this be related to PyTorch 2.8 or CUDA 12.9 compatibility?

Thanks in advance for any help!

Environment

  • GPU: NVIDIA RTX 5090
  • OS: Linux
  • Python: 3.10.19
  • PyTorch: 2.8.0+cu129
  • CUDA: 12.9

Minimal Reproducible Example

#!/usr/bin/env python3

DATASET_DIR = "/path/to/dataset"
OUTPUT_DIR = "/path/to/output"

EPOCHS = 35
BATCH_SIZE = 9
GRAD_ACCUM_STEPS = 8
LEARNING_RATE = 1e-4

from rfdetr import RFDETRSegLarge
from rfdetr.datasets.aug_config import AUG_INDUSTRIAL
import torch

EARLY_STOPPING = True
EARLY_STOPPING_PATIENCE = 10
PROJECT_NAME = "rfdetr"

def main():
    model = RFDETRSegLarge()
    print(f"Model input resolution: {model.model.resolution}")

    model.train(
        dataset_dir=DATASET_DIR,
        aug_config=AUG_INDUSTRIAL,
        run_test=False,
        checkpoint_interval=2,
        epochs=EPOCHS,
        batch_size=BATCH_SIZE,
        grad_accum_steps=GRAD_ACCUM_STEPS,
        lr=LEARNING_RATE,
        output_dir=OUTPUT_DIR,
        project=PROJECT_NAME,
        early_stopping=EARLY_STOPPING,
        early_stopping_patience=EARLY_STOPPING_PATIENCE,
        progress_bar=True
    )

if __name__ == "__main__":
    main()

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions