Fine-tuned model runs on CPU instead of GPU with CoreML on Apple Silicon

### Search before asking

- [x] I have searched the RF-DETR issues and found no similar bug report.


### Bug

**Description:**
When running inference with my **fine-tuned RF-DETR model** (`player-and-handball-detection-3z9xf/1`) on Apple Silicon, the model only uses CPU despite CoreML being configured. The **base RF-DETR model** (e.g., `rfdetr-small`) runs on GPU as expected when loaded via `RFDETRSmall()`, but **not** when loaded via `get_model("rfdetr-small")`.

---

**Key Observations:**
- Fine-tuned model (`get_model("player-and-handball-detection-3z9xf/1")`) → **CPU only**.
- Base model (`RFDETRSmall()`) → **GPU works**.
- Base model via `get_model("rfdetr-small")` → **CPU only** (same as fine-tuned model).
- GPU utilization remains at **0%** during inference for `get_model()` cases.



### Environment

- **Roboflow Inference Version:** `0.59.0`
- **OS:** macOS 26.0.1
- **Python Version:** `3.12.11`
- **ONNX Runtime Version:** `1.21.1`
- **Hardware:** MacBook Pro M1 Max (64GB RAM)


### Minimal Reproducible Example

```python
import os
import time
import cv2
import supervision as sv
from inference import get_model
import onnxruntime as ort

# Configure CoreML
os.environ["INFERENCE_LOG_LEVEL"] = "DEBUG"
os.environ["ONNXRUNTIME_EXECUTION_PROVIDERS"] = "[CoreMLExecutionProvider]"
os.environ["ORT_LOG_SEVERITY_LEVEL"] = "0"      # VERBOSE
os.environ["ORT_LOG_VERBOSITY_LEVEL"] = "1"     # Extra detail
print("[ONNX Runtime] Available providers:", ort.get_available_providers())

# Load API key and model
API_KEY = os.getenv("ROBOFLOW_API_KEY")
MODEL_ID = "player-and-handball-detection-3z9xf/1"
VIDEO_IN = "https://commondatastorage.googleapis.com/gtv-videos-bucket/sample/BigBuckBunny.mp4"
VIDEO_OUT = "Clip_annotated_coreml.mp4"

print("[Inference] Loading model…")
model = get_model(MODEL_ID, api_key=API_KEY)

# Process video
cap = cv2.VideoCapture(VIDEO_IN)
assert cap.isOpened(), f"Cannot open {VIDEO_IN}"
w, h = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH)), int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
fps_in = cap.get(cv2.CAP_PROP_FPS) or 25.0
vw = cv2.VideoWriter(VIDEO_OUT, cv2.VideoWriter_fourcc(*"mp4v"), fps_in, (w, h))
box_annot = sv.BoxAnnotator(thickness=2)
font = cv2.FONT_HERSHEY_SIMPLEX
t0_all = time.time()
n = 0

print("[Inference] Started…")
while True:
    ok, frame = cap.read()
    if not ok:
        break

    t0 = time.time()
    n += 1
    res = model.infer(frame, confidence=0.25, iou=0.5)[0]
    dets = sv.Detections.from_inference(res)
    annotated = box_annot.annotate(scene=frame.copy(), detections=dets)

    # Draw labels
    names = dets.data.get("class_name", [str(i) for i in dets.class_id])
    labels = [f"{n_} {c:.2f}" for n_, c in zip(names, dets.confidence)]
    for (x1, y1, x2, y2), text in zip(dets.xyxy, labels):
        cv2.putText(annotated, text, (int(x1), max(0, int(y1) - 4)), font, 0.5, (0, 255, 0), 1, cv2.LINE_AA)

    # Calculate FPS
    dt = time.time() - t0
    fps_inst = 1.0 / dt if dt > 0 else 0.0
    fps_avg = n / (time.time() - t0_all + 1e-9)
    txt = f"FPS: {fps_inst:.1f} (avg {fps_avg:.1f})"
    cv2.putText(annotated, txt, (12, 28), font, 0.8, (0, 0, 0), 4, cv2.LINE_AA)
    cv2.putText(annotated, txt, (12, 28), font, 0.8, (255, 255, 0), 2, cv2.LINE_AA)

    if n % 30 == 0:
        print(f"[{n}] inst={fps_inst:.2f} | avg={fps_avg:.2f} | preds={len(dets)}")

    vw.write(annotated)

cap.release()
vw.release()
print(f"[Done] Saved: {VIDEO_OUT}")
```


### Additional


**Terminal Output:**
```
[ONNX Runtime] Available providers: ['CoreMLExecutionProvider', 'AzureExecutionProvider', 'CPUExecutionProvider']
[Inference] Loading model…
UserWarning: Specified provider 'CUDAExecutionProvider' is not in available provider names. Available providers: 'CoreMLExecutionProvider, AzureExecutionProvider, CPUExecutionProvider'
[Inference] Started…
[30] inst=2.73 | avg=2.59 | preds=19
...
```

---

**Expected Behavior:**
- The fine-tuned model should utilize the **Apple Silicon GPU** for inference, similar to the base `RFDETRSmall` model.
- GPU utilization should be visible in **Activity Monitor** during inference.

---

**Actual Behavior:**
- The fine-tuned model runs **entirely on CPU**, resulting in lower FPS (~2.59 FPS) and **no GPU utilization**.
- The same behavior occurs with the base model when loaded via `get_model("rfdetr-small")`.
- The base `RFDETRSmall` model runs on GPU with the same script and environment.

---

**Diagnostic Information:**
- **Activity Monitor** confirms **0% GPU usage** during inference.

### Are you willing to submit a PR?

- [ ] Yes, I'd like to help by submitting a PR!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fine-tuned model runs on CPU instead of GPU with CoreML on Apple Silicon #427

Search before asking

Bug

Environment

Minimal Reproducible Example

Additional

Are you willing to submit a PR?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Fine-tuned model runs on CPU instead of GPU with CoreML on Apple Silicon #427

Description

Search before asking

Bug

Environment

Minimal Reproducible Example

Additional

Are you willing to submit a PR?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions