Huge performance differnce between the datasets on RF-DETR Nano

### Search before asking

- [x] I have searched the RF-DETR issues and found no similar bug report.


### Bug

I am currently working on comparing RF-DETR Nano models with YOLOv11 models. In particular I am trying to now validate the statement that the RF-DETR Nano is faster and higher quality than YOLOv11-medium. It does apper to be the case in sake of my football players dataset:
YOLO RESULTS:
<img width="1197" height="528" alt="Image" src="https://github.com/user-attachments/assets/1732ed8f-d69b-43ac-bf1c-eea45e4c1356" />
RF-DETR NANO RESULTS:
<img width="1133" height="606" alt="Image" src="https://github.com/user-attachments/assets/a017dbff-cc5b-4fe6-8659-024052231fe4" />
However I get significant performance issues when running RF-DETR Nano for ball dataset:

<img width="1169" height="616" alt="Image" src="https://github.com/user-attachments/assets/4d0c2500-7365-4ca2-bc93-b1eb0a65f279" />

The performance for ball dataset dropped more than two times in case of lower resolutions.
Link to models: [here](https://drive.google.com/drive/folders/1EPGB92tI7jQrCldqHRVEDDNCRW61rNxk?usp=sharing)
link to datasets:
ball: [here](https://app.roboflow.com/yolov8-rb1a5/plater-refree-goalkeeper-dataset-q4etg/browse?queryText=&pageSize=50&startingIndex=0&browseQuery=true)
players: [here](https://app.roboflow.com/yolov8-rb1a5/ball-detection-msnhg/browse?queryText=&pageSize=50&startingIndex=0&browseQuery=true)

### Environment

RF-DETR version: 1.6.3
OS: Windows-10-10.0.26200-SP0
Python version: 3.11.9 (tags/v3.11.9:de54cf5, Apr  2 2024, 10:12:12) [MSC v.1938 64 bit (AMD64)]
PyTorch version: 2.11.0+cu128
PyTorch CUDA version: 12.8
GPU 0: NVIDIA GeForce RTX 3070 Laptop GPU

### Minimal Reproducible Example

The rf-detr models are run with optimize_for_inference(). This code calculates the adapter_ms:

```python
def rfdetr_predict_to_predet(
    model,
    pil_img: Image.Image,
    threshold: float,
    active_cids: List[int],
) -> U.PredictResult:
    U.cuda_sync_if_needed()
    adapter_t0 = time.perf_counter()

    if not hasattr(model, "_debug_predict_printed"):
        setattr(model, "_debug_predict_printed", True)
        print("=" * 80)
        print("[FIRST PREDICT]")
        print("model detected device:", inspect_model_device(model))
        print("image size:", pil_img.size)
        print("threshold:", threshold)
        print_gpu_memory("[CUDA BEFORE PREDICT]")

    dets = None
    model_ms = None
    extra_timings: Dict[str, Any] = {}

    try:
        out = model.predict(pil_img, threshold=threshold, return_timings=True)
        if isinstance(out, tuple) and len(out) == 2 and isinstance(out[1], dict):
            dets, timing = out
            if timing.get("forward_ms") is not None:
                model_ms = float(timing["forward_ms"])
            for key in ["preprocess_ms", "forward_ms", "postprocess_ms", "convert_ms", "total_ms"]:
                if timing.get(key) is not None:
                    extra_timings[key] = float(timing[key])
        else:
            dets = out
    except TypeError:
        dets, model_ms = U.timed_call_ms(
            lambda: model.predict(pil_img, threshold=threshold)
        )

    if getattr(model, "_debug_predict_printed", False) and not hasattr(model, "_debug_predict_done_printed"):
        setattr(model, "_debug_predict_done_printed", True)
        print_gpu_memory("[CUDA AFTER PREDICT]")
        print("timings:", extra_timings if extra_timings else {"forward_ms": model_ms})
        print("=" * 80)

    empty_pred = U.PredDet(
        np.zeros((0, 4), float),
        np.zeros((0,), float),
        np.zeros((0,), int),
    )

    if dets is None or len(dets) == 0:
        U.cuda_sync_if_needed()
        adapter_t1 = time.perf_counter()
        return U.PredictResult(
            pred=empty_pred,
            adapter_ms=(adapter_t1 - adapter_t0) * 1000.0,
            model_ms=model_ms,
            extra_timings=extra_timings or None,
        )

    xyxy = np.asarray(dets.xyxy, dtype=float)
    scores = np.asarray(dets.confidence, dtype=float)
    cls = np.asarray(dets.class_id, dtype=int)

    keep = []
    out_cids = []
    for i in range(len(cls)):
        cid = int(cls[i])
        if cid in SKIP_COCO_CATEGORY_IDS:
            continue
        if cid not in active_cids:
            continue
        keep.append(i)
        out_cids.append(cid)

    if not keep:
        U.cuda_sync_if_needed()
        adapter_t1 = time.perf_counter()
        return U.PredictResult(
            pred=empty_pred,
            adapter_ms=(adapter_t1 - adapter_t0) * 1000.0,
            model_ms=model_ms,
            extra_timings=extra_timings or None,
        )

    idx = np.array(keep, dtype=int)
    pred = U.PredDet(
        boxes_xyxy=xyxy[idx],
        scores=scores[idx],
        class_ids=np.array(out_cids, dtype=int),
    )

    U.cuda_sync_if_needed()
    adapter_t1 = time.perf_counter()
    return U.PredictResult(
        pred=pred,
        adapter_ms=(adapter_t1 - adapter_t0) * 1000.0,
        model_ms=model_ms,
        extra_timings=extra_timings or None,
    )
```



### Additional

_No response_

### Are you willing to submit a PR?

- [ ] Yes, I'd like to help by submitting a PR!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Huge performance differnce between the datasets on RF-DETR Nano #958

Search before asking

Bug

Environment

Minimal Reproducible Example

Additional

Are you willing to submit a PR?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Huge performance differnce between the datasets on RF-DETR Nano #958

Description

Search before asking

Bug

Environment

Minimal Reproducible Example

Additional

Are you willing to submit a PR?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions