Open
Description
Describe the bug
The cucim.skimage.transform.PiecewiseAffineTransform seems to be several times slower than the scikit-image equivalent
Steps/Code to reproduce bug
When running the code below, I observe a 8x slowdown for the estimate and 2x slowdown for the warp operations using the PyTorch 24.01 container with cucim 23.12
Expected behavior
The code should execute at least as fast as the cpu version
Environment details (please complete the following information):
Docker on Ubuntu 22.04
PyTorch 24.01 container with scikit-image and cucim 23.12 pip installed
Additional context
`import matplotlib.pyplot as plt
from skimage.transform import PiecewiseAffineTransform, warp
from scipy.interpolate import LinearNDInterpolator
import numpy as np
from timeit import default_timer as timer
from cucim.skimage.transform import PiecewiseAffineTransform as cu_PAT
from cucim.skimage.transform import warp as cu_warp
import cupy as cp
# create some offsets and coordinates
vectors = np.array([[3.0,1.0],[-5.,-1.3],[-3.5,8.3],[0,0],[0,0],[0,0], [0,0]])
coords = np.array([[20,20],[180,50],[20, 180],[0,0],[0,255],[255,0], [255,255]])
# Create grid
step_size = 20
x = np.linspace(0, 255, num=step_size)
y = np.linspace(0, 255, num=step_size)
X, Y = np.meshgrid(x, y)
interpx = LinearNDInterpolator(list(coords), vectors[:,0])
Zxi = interpx(Y, X)
interpy = LinearNDInterpolator(list(coords), vectors[:,1])
Zyi = interpy(Y, X)
# create an array of coords
src = np.column_stack((X.reshape(-1), Y.reshape(-1)))
# add the interpolated offets
dst_rows = X + Zxi
dst_cols = Y + Zyi
dst = np.column_stack([dst_cols.reshape(-1), dst_rows.reshape(-1)])
# compute transforms
tform = PiecewiseAffineTransform()
start = timer()
tform.estimate(src, dst)
print("cpu estimate took {}s".format(timer()-start))
start = timer()
out = warp(imgrid, tform, output_shape=(255, 255))
print("cpu warp took {}s".format(timer()-start))
# repeat using cupy/cucim.skimage
cu_tform = cu_PAT()
start = timer()
cu_tform.estimate(cp.array(src), cp.array(dst))
print("gpu estimate took {}s".format(timer()-start))
start = timer()
out = cu_warp(cp.array(imgrid), cu_tform, output_shape=(255, 255))
print("gpu warp took {}s".format(timer()-start))
`