Skip to content

F-order dpnp arrays make Ridge GPU array API extremely slow #3235

@cakedev0

Description

@cakedev0

Summary

sklearnex.linear_model.Ridge on GPU is very slow with dpnp arrays when the input is Fortran-contiguous. The same array API path with C-contiguous dpnp arrays is fast.

This showed up in a benchmark case using make_regression(n_samples=10_000_000, n_features=2) with array_api_dispatch=True, format=dpnp, and default benchmark data:order=F. The case appears to hang because a single fit/validation call runs far past the benchmark time limit.

Reproducer

import time

import dpnp
from sklearn.datasets import make_regression
from sklearnex import config_context
from sklearnex.linear_model import Ridge


def timed(label, fn, queue):
    t0 = time.perf_counter()
    out = fn()
    queue.wait()
    print(f"{label}: {time.perf_counter() - t0:.3f}s")
    return out


X0, y0 = make_regression(
    n_samples=1_000_000,
    n_features=2,
    n_informative=2,
    noise=0.1,
    random_state=0,
)
y0 = y0.reshape(-1, 1)

for order in ["C", "F"]:
    X = dpnp.asarray(X0, order=order)
    y = dpnp.asarray(y0, order=order)

    print(f"order={order}, device={X.device}")
    with config_context(
        array_api_dispatch=True,
        allow_fallback_to_host=False,
        allow_sklearn_after_onedal=False,
    ):
        timed("fit #1", lambda: Ridge().fit(X, y), X.sycl_queue)
        timed("fit #2", lambda: Ridge().fit(X, y), X.sycl_queue)

Observed timings

On Intel Arc B390 GPU, Level Zero backend:

order=C, device=Device(level_zero:gpu:0)
fit #1: 0.762s
fit #2: 0.067s
order=F, device=Device(level_zero:gpu:0)
fit #1: 13.610s
fit #2: 13.688s

Timing onedal.utils.validation.check_all_finite(X) and direct oneDAL Ridge fit are both
responsible in this slowdown.

Expected

Fortran-contiguous dpnp input should not be orders of magnitude slower than C-contiguous dpnp input for this tall/narrow Ridge case, or it should be copied/normalized once to an efficient layout.

Notes from source inspection

The Ridge GPU normal-equation path uses row_accessor and processes the table in row blocks.

The finiteness checker GPU path also converts the table to a device 1D ndarray before reduction.

Local environment

scikit-learn: 1.8.0
scikit-learn-intelex: 2199.9.9
onedal python: 2021.6
dpnp: 0.20.0
dpctl: 0.22.1
numpy: 2.4.6
device: Intel(R) Arc(TM) B390 GPU
backend: level_zero

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions