Skip to content

Huge performance differnce between the datasets on RF-DETR Nano #958

@MichalTurek

Description

@MichalTurek

Search before asking

  • I have searched the RF-DETR issues and found no similar bug report.

Bug

I am currently working on comparing RF-DETR Nano models with YOLOv11 models. In particular I am trying to now validate the statement that the RF-DETR Nano is faster and higher quality than YOLOv11-medium. It does apper to be the case in sake of my football players dataset:
YOLO RESULTS:
Image
RF-DETR NANO RESULTS:
Image
However I get significant performance issues when running RF-DETR Nano for ball dataset:

Image

The performance for ball dataset dropped more than two times in case of lower resolutions.
Link to models: here
link to datasets:
ball: here
players: here

Environment

RF-DETR version: 1.6.3
OS: Windows-10-10.0.26200-SP0
Python version: 3.11.9 (tags/v3.11.9:de54cf5, Apr 2 2024, 10:12:12) [MSC v.1938 64 bit (AMD64)]
PyTorch version: 2.11.0+cu128
PyTorch CUDA version: 12.8
GPU 0: NVIDIA GeForce RTX 3070 Laptop GPU

Minimal Reproducible Example

The rf-detr models are run with optimize_for_inference(). This code calculates the adapter_ms:

def rfdetr_predict_to_predet(
    model,
    pil_img: Image.Image,
    threshold: float,
    active_cids: List[int],
) -> U.PredictResult:
    U.cuda_sync_if_needed()
    adapter_t0 = time.perf_counter()

    if not hasattr(model, "_debug_predict_printed"):
        setattr(model, "_debug_predict_printed", True)
        print("=" * 80)
        print("[FIRST PREDICT]")
        print("model detected device:", inspect_model_device(model))
        print("image size:", pil_img.size)
        print("threshold:", threshold)
        print_gpu_memory("[CUDA BEFORE PREDICT]")

    dets = None
    model_ms = None
    extra_timings: Dict[str, Any] = {}

    try:
        out = model.predict(pil_img, threshold=threshold, return_timings=True)
        if isinstance(out, tuple) and len(out) == 2 and isinstance(out[1], dict):
            dets, timing = out
            if timing.get("forward_ms") is not None:
                model_ms = float(timing["forward_ms"])
            for key in ["preprocess_ms", "forward_ms", "postprocess_ms", "convert_ms", "total_ms"]:
                if timing.get(key) is not None:
                    extra_timings[key] = float(timing[key])
        else:
            dets = out
    except TypeError:
        dets, model_ms = U.timed_call_ms(
            lambda: model.predict(pil_img, threshold=threshold)
        )

    if getattr(model, "_debug_predict_printed", False) and not hasattr(model, "_debug_predict_done_printed"):
        setattr(model, "_debug_predict_done_printed", True)
        print_gpu_memory("[CUDA AFTER PREDICT]")
        print("timings:", extra_timings if extra_timings else {"forward_ms": model_ms})
        print("=" * 80)

    empty_pred = U.PredDet(
        np.zeros((0, 4), float),
        np.zeros((0,), float),
        np.zeros((0,), int),
    )

    if dets is None or len(dets) == 0:
        U.cuda_sync_if_needed()
        adapter_t1 = time.perf_counter()
        return U.PredictResult(
            pred=empty_pred,
            adapter_ms=(adapter_t1 - adapter_t0) * 1000.0,
            model_ms=model_ms,
            extra_timings=extra_timings or None,
        )

    xyxy = np.asarray(dets.xyxy, dtype=float)
    scores = np.asarray(dets.confidence, dtype=float)
    cls = np.asarray(dets.class_id, dtype=int)

    keep = []
    out_cids = []
    for i in range(len(cls)):
        cid = int(cls[i])
        if cid in SKIP_COCO_CATEGORY_IDS:
            continue
        if cid not in active_cids:
            continue
        keep.append(i)
        out_cids.append(cid)

    if not keep:
        U.cuda_sync_if_needed()
        adapter_t1 = time.perf_counter()
        return U.PredictResult(
            pred=empty_pred,
            adapter_ms=(adapter_t1 - adapter_t0) * 1000.0,
            model_ms=model_ms,
            extra_timings=extra_timings or None,
        )

    idx = np.array(keep, dtype=int)
    pred = U.PredDet(
        boxes_xyxy=xyxy[idx],
        scores=scores[idx],
        class_ids=np.array(out_cids, dtype=int),
    )

    U.cuda_sync_if_needed()
    adapter_t1 = time.perf_counter()
    return U.PredictResult(
        pred=pred,
        adapter_ms=(adapter_t1 - adapter_t0) * 1000.0,
        model_ms=model_ms,
        extra_timings=extra_timings or None,
    )

Additional

No response

Are you willing to submit a PR?

  • Yes, I'd like to help by submitting a PR!

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingquestionFurther information is requested

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions