perf(inference_models): batch instance-seg RLE mask encoding by alexnorell · Pull Request #2517 · roboflow/inference

alexnorell · 2026-06-23T17:38:31Z

What

Batch the instance-segmentation RLE mask encoding so post-processing does one device→host transfer per image instead of one per detection.

Why

Instance-seg post-processing encoded masks one detection at a time via align_instance_segmentation_results_to_rle_masks, whose torch_mask_to_coco_rle performs a .cpu() sync per mask (lengths.cpu().tolist() + a GPU scalar read). On Jetson those per-detection syncs serialize the GPU ~2·N times per frame and dominate seg post-processing — a major contributor to the ~50% throughput drop seen running rf-detr-seg on a JetPack 6.2 device vs. detection.

How

New torch_masks_to_coco_rle_batch(masks) in models/common/rle_utils.py: a single [N,H,W]→[H,W,N] device→host transfer + one vectorized pycocotools.mask.encode.
Switch the RLE post-process to the already-existing batched align_instance_segmentation_results (the same routine the dense path uses) followed by the batched encode, replacing the per-detection generator loop.
Applied to: RF-DETR (the non-Triton fallback path; the Triton fast path on main is unchanged) and YOLOv5 / YOLOv7 / YOLOv8 / YOLO26 / YOLACT.

Behavior-preserving

Output is byte-identical — same boxes, same compressed RLE counts. Locked by a new test, tests/unit_tests/models/common/test_rle_batch_encode.py:

torch_masks_to_coco_rle_batch == per-mask torch_mask_to_coco_rle (random / single / all-zero / all-one / empty).
align_instance_segmentation_results + batch encode == the per-detection generator, for both boxes and RLE, across no-crop / scaled / static-crop / odd-dims / single / empty cases.

Test plan

pytest tests/unit_tests/models/common/test_rle_batch_encode.py tests/unit_tests/models/common/test_rle_utils.py → green (CUDA cases skip on CPU).
No accuracy change expected; this is a scheduling/transfer optimization only.

Draft: opening for review of the approach + scope before marking ready.

Instance-segmentation post-processing encoded masks one detection at a time via align_instance_segmentation_results_to_rle_masks, whose torch_mask_to_coco_rle does a device->host .cpu() sync per detection. On Jetson those per-detection syncs serialize the GPU N times per frame and dominate seg post-processing. Replace it with the batched align_instance_segmentation_results plus a new torch_masks_to_coco_rle_batch helper that encodes all masks with a single device->host transfer and one vectorized pycocotools.encode call. Output is byte-identical (locked by the added equivalence tests). Applied to RF-DETR (non-Triton fallback) and YOLOv5/7/8/26 + YOLACT.

alexnorell and others added 2 commits June 23, 2026 12:37

Merge branch 'main' into perf/seg-rle-batch-encode

a934a66

dkosowski87 added the review-auto label Jun 26, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf(inference_models): batch instance-seg RLE mask encoding#2517

perf(inference_models): batch instance-seg RLE mask encoding#2517
alexnorell wants to merge 2 commits into
mainfrom
perf/seg-rle-batch-encode

alexnorell commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

alexnorell commented Jun 23, 2026

What

Why

How

Behavior-preserving

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants