Summary
sklearnex.linear_model.Ridge on GPU is very slow with dpnp arrays when the input is Fortran-contiguous. The same array API path with C-contiguous dpnp arrays is fast.
This showed up in a benchmark case using make_regression(n_samples=10_000_000, n_features=2) with array_api_dispatch=True, format=dpnp, and default benchmark data:order=F. The case appears to hang because a single fit/validation call runs far past the benchmark time limit.
Reproducer
import time
import dpnp
from sklearn.datasets import make_regression
from sklearnex import config_context
from sklearnex.linear_model import Ridge
def timed(label, fn, queue):
t0 = time.perf_counter()
out = fn()
queue.wait()
print(f"{label}: {time.perf_counter() - t0:.3f}s")
return out
X0, y0 = make_regression(
n_samples=1_000_000,
n_features=2,
n_informative=2,
noise=0.1,
random_state=0,
)
y0 = y0.reshape(-1, 1)
for order in ["C", "F"]:
X = dpnp.asarray(X0, order=order)
y = dpnp.asarray(y0, order=order)
print(f"order={order}, device={X.device}")
with config_context(
array_api_dispatch=True,
allow_fallback_to_host=False,
allow_sklearn_after_onedal=False,
):
timed("fit #1", lambda: Ridge().fit(X, y), X.sycl_queue)
timed("fit #2", lambda: Ridge().fit(X, y), X.sycl_queue)
Observed timings
On Intel Arc B390 GPU, Level Zero backend:
order=C, device=Device(level_zero:gpu:0)
fit #1: 0.762s
fit #2: 0.067s
order=F, device=Device(level_zero:gpu:0)
fit #1: 13.610s
fit #2: 13.688s
Timing onedal.utils.validation.check_all_finite(X) and direct oneDAL Ridge fit are both
responsible in this slowdown.
Expected
Fortran-contiguous dpnp input should not be orders of magnitude slower than C-contiguous dpnp input for this tall/narrow Ridge case, or it should be copied/normalized once to an efficient layout.
Notes from source inspection
The Ridge GPU normal-equation path uses row_accessor and processes the table in row blocks.
The finiteness checker GPU path also converts the table to a device 1D ndarray before reduction.
Local environment
scikit-learn: 1.8.0
scikit-learn-intelex: 2199.9.9
onedal python: 2021.6
dpnp: 0.20.0
dpctl: 0.22.1
numpy: 2.4.6
device: Intel(R) Arc(TM) B390 GPU
backend: level_zero
Summary
sklearnex.linear_model.Ridgeon GPU is very slow withdpnparrays when the input is Fortran-contiguous. The same array API path with C-contiguousdpnparrays is fast.This showed up in a benchmark case using
make_regression(n_samples=10_000_000, n_features=2)witharray_api_dispatch=True,format=dpnp, and default benchmarkdata:order=F. The case appears to hang because a single fit/validation call runs far past the benchmark time limit.Reproducer
Observed timings
On Intel Arc B390 GPU, Level Zero backend:
Timing
onedal.utils.validation.check_all_finite(X)and direct oneDAL Ridge fit are bothresponsible in this slowdown.
Expected
Fortran-contiguous
dpnpinput should not be orders of magnitude slower than C-contiguousdpnpinput for this tall/narrow Ridge case, or it should be copied/normalized once to an efficient layout.Notes from source inspection
The Ridge GPU normal-equation path uses
row_accessorand processes the table in row blocks.The finiteness checker GPU path also converts the table to a device 1D ndarray before reduction.
Local environment