Skip to content

RoiAlign CPU is not aligned to pixel centers (per the Mask RCNN paper and Facebook's Detectron2 implementation) #6921




Describe the bug
The RoiAlign operator, per the Mask RCNN paper and Facebook Research's Detectron 2 implementation aligns sampling points over the center of the pixels, but ORT's CPU implementation is misaligned by a half pixel. After comparing ORT to various references (table below), I see current ORT code duplicated PyTorch's earlier bug in roi_align which applied an offset the output subsample by 0.5 but forgot to adjust the input sample to compensate (see their comment in the code: "the original roi_align (aligned=False) does not subtract the 0.5 when computing neighboring pixel indices and therefore it uses pixels with a slightly incorrect alignment (relative to our pixel model) when performing bilinear interpolation").

From the paper, note pixel centers used for interpolation:

This isn't as evident for larger input image regions, where that misalignment becomes less important relative to the overall region size, but it makes quite a difference for smaller regions. Even identity cases are misaligned (where the region of interest exactly matches the output tensor size). e.g. Taking the middle 2x2 slice of a 4x4 input to a 2x2 output (integer coordinates, no scale factor) should yield exactly that input slice, but ORT's result are shifted half a pixel off.

Relevant Links

No deadline.

System information

  • OS Platform and Distribution: NA, but Windows 10 recent selfhost
  • ONNX Runtime installed from (source or binary): source
  • ONNX Runtime version: 1.7
  • Python version: NA
  • Visual Studio version (if applicable): VS2019
  • GCC/Compiler version (if compiling from source): NA
  • CUDA/cuDNN version: NA
  • GPU model and memory: NA

To Reproduce

Expected behavior

  • For the identity test case:
    • Expected output: [[[[11, 12], [21, 22]]]]
    • Actual output: [[[[5.50, 5.75], [8.00, 8.25]]]]
  • For the detectron test case:
    • Expected output: [[[[ 8.25, 8.75, 9.25, 9.75], [13.25, 13.75, 14.25, 14.75], [18.25, 18.75, 19.25, 19.75], [23.25, 23.75, 24.25, 24.75]]]]
    • Actual output: [[[[6.1875, 6.75, 6.75, 7.3125], [11.8125, 12.375, 12.375, 12.9375], [11.8125, 12.375, 12.375, 12.9375], [17.4375, 18, 18, 18.5625]]]]


Additional context

This affects the faster_rcnn and mask_rcnn models in WinML, for which the expected output results appear to have been recorded using the incorrect alignment via CPU in the first place, whereas DML follows half pixel alignment (matching Detectron 2) and gets different results than the output .PB files.

For an example case (modified from the Detectron test case), and comparison to other framework results:

Input Tensor =
              0.0 1.0 2.0 3.0 4.0 5.0 6.0
               |.5 |.5 |.5 |.5 |.5 |.5 |
        0.0___ |_|_|_|_|_|_|_|_|_|_|_|_|
        1.0___[| 0,| 1,| 2,| 3,| 4,| 5 ]
    /|\ 2.0___[|10,┃11,┃12,┃13,|14,|15 ]
    \|/ 3.0___[|20,┃21,┃22,┃23,|24,|25 ]
        4.0___[|30,|31,|32,|33,|34,|35 ]
        5.0___[|40,|41,|42,|43,|44,|45 ]
        6.0___[|50,|51,|52,|53,|54,|55 ]

Active region of interest = [[1.0, 1.0, 3.0, 3.0]] // a 2x2 window over the input elements
Input tensor window = [[11,12],[21,22]]
Output tensor size = [4,4]
Image Source Output 4x4, from first 2x2 region
image ✔ FB Research Detectron 2 (MaskedRCNN paper) [ 8.25,  8.75,  9.25,  9.75],
[13.25, 13.75, 14.25, 14.75],
[18.25, 18.75, 19.25, 19.75],
[23.25, 23.75, 24.25, 24.75]
image ✔ ONNX Runtime DML EP (ROI_ALIGN 0) [ 8.25,  8.75,  9.25,  9.75],
[13.25, 13.75, 14.25, 14.75],
[18.25, 18.75, 19.25, 19.75],
[23.25, 23.75, 24.25, 24.75]
image ✔ ONNX Runtime 1.7 CPU Resize + Slice
[ 8.25,  8.75,  9.25,  9.75],
[13.25, 13.75, 14.25, 14.75],
[18.25, 18.75, 19.25, 19.75],
[23.25, 23.75, 24.25, 24.75]
image torchvision.ops.roi_align(aligned=True…) [ 8.25,  8.75,  9.25,  9.75],
[13.25, 13.75, 14.25, 14.75],
[18.25, 18.75, 19.25, 19.75],
[23.25, 23.75, 24.25, 24.75]
image torchvision.ops.roi_align(aligned=False…)
*deprecated, legacy flag still exists
[13.75, 14.25, 14.75, 15.25],
[18.75, 19.25, 19.75, 20.25],
[23.75, 24.25, 24.75, 25.25],
[28.75, 29.25, 29.75, 30.25]
image ONNX Runtime 1.7 CPU EP RoiAlign [13.75, 14.25, 14.75, 15.25],
[18.75, 19.25, 19.75, 20.25],
[23.75, 24.25, 24.75, 25.25],
[28.75, 29.25, 29.75, 30.25]
image tf.image.crop_and_resize(…)
*Note boxes are normalized 0 to 1 (so /5 each ROI element)
[11.00, 11.66, 12.33, 13.00],
[17.66, 18.33, 19.00, 19.66],
[24.33, 25.00, 25.66, 26.33],
[31.00, 31.66, 32.33, 33.00]
image tf.image.resize_bilinear(align_corners=True…)
+ tf.slice
[11.00, 11.66, 12.33, 13.00],
[17.66, 18.33, 19.00, 19.66],
[24.33, 25.00, 25.66, 26.33],
[31.00, 31.66, 32.33, 33.00]
image tf.image.resize_bilinear(align_corners=False…)
+ tf.slice
[11.00, 11.50, 12.00, 12.50],
[16.00, 16.50, 17.00, 17.50],
[21.00, 21.50, 22.00, 22.50],
[26.00, 26.50, 27.00, 27.50]
image tf.image.resize_bilinear(half_pixel_centers=True…)
+ tf.slice
[ 8.25,  8.75,  9.25,  9.75],
[13.25, 13.75, 14.25, 14.75],
[18.25, 18.75, 19.25, 19.75],
[23.25, 23.75, 24.25, 24.75]
(todo) torch.nn.functional.interpolate

Even the ONNX backend conformance test case has these misaligned numbers:

PyTorch sample code:

# pip install torch==1.7.1+cpu torchvision==0.8.2+cpu torchaudio===0.7.2 -f
import torch
import torchvision
print("PyTorch version:", torch.__version__)

input = [[[[ 0, 1, 2, 3, 4, 5], # NCHW
boxes = [[0, 1,1,3,3]]
output_size = [4,4]
aligned=True # Correct
#aligned=False # Legacy setting

output = torchvision.ops.roi_align(
    torch.tensor(input, dtype=torch.float),
    torch.tensor(boxes, dtype=torch.float),


TensorFlow sample code:

# pip install tensorflow-gpu==1.15.0
import os
import tensorflow.compat.v1 as tf

input = [[ # NHWC
            [[ 0.], [ 1.], [ 2.], [ 3.], [ 4.], [ 5.]],
            [[10.], [11.], [12.], [13.], [14.], [15.]],
            [[20.], [21.], [22.], [23.], [24.], [25.]],
            [[30.], [31.], [32.], [33.], [34.], [35.]],
            [[40.], [41.], [42.], [43.], [44.], [45.]],
            [[50.], [51.], [52.], [53.], [54.], [55.]]
boxes = [[1/5,1/5,3/5,3/5],[3/5,3/5,4/5,4/5]] # Normalized 0.0 to 1.0 (where 1.0 = width - 1 and height - 1)
box_indices = [0, 0] # Batch indices per corresponding region
crop_size = [4, 4] # Output tensor size HW

print("TensorFlow version:", tf.__version__) # 1.15.0 (cpu/cuda)

# Using half_pixel_centers=True is correct (not align_corners=True)
output_size = [6*2, 6*2]
resize_output = tf.image.resize_bilinear(tf.constant(input), output_size, align_corners=False, half_pixel_centers=True)
resize_bilinear_slice_output = tf.slice(resize_output, [0,2,2,0], [1,4,4,1])

# Note crop_and_resize doesn't scale the image boundaries to pixel centers, but always to corners,
# and there is sadly no flag to influence this (unlike resize_bilinear).
method = 'bilinear'
extrapolation_value = 0
crop_and_resize_output = tf.image.crop_and_resize(
    image=tf.constant(input, dtype=tf.float32), # NHWC
    boxes=tf.constant(boxes, dtype=tf.float32),
    box_ind=tf.constant(box_indices, dtype=tf.int32),
    crop_size=tf.constant(crop_size, dtype=tf.int32),

with tf.Session(config=config) as session:
    with np.printoptions(precision=3, suppress=True):
        print("input:\n", input)

Facebook research's Detectron 2 test code:

class ROIAlignTest(unittest.TestCase):
    def test_forward_output(self):
        input = np.arange(25).reshape(5, 5).astype("float32")
        0  1  2   3 4
        5  6  7   8 9
        10 11 12 13 14
        15 16 17 18 19
        20 21 22 23 24

        output = self._simple_roialign(input, [1, 1, 3, 3], (4, 4), aligned=False)
        output_correct = self._simple_roialign(input, [1, 1, 3, 3], (4, 4), aligned=True)

        # without correction:
        old_results = [
            [7.5, 8, 8.5, 9],
            [10, 10.5, 11, 11.5],
            [12.5, 13, 13.5, 14],
            [15, 15.5, 16, 16.5],

        # with 0.5 correction:
        correct_results = [
            [4.5, 5.0, 5.5, 6.0],
            [7.0, 7.5, 8.0, 8.5],
            [9.5, 10.0, 10.5, 11.0],
            [12.0, 12.5, 13.0, 13.5],
        # This is an upsampled version of [[6, 7], [11, 12]]





ep:DMLissues related to the DirectML execution provider


No type


No projects


No milestone


None yet


No branches or pull requests

Issue actions