F-order dpnp arrays make Ridge GPU array API extremely slow


### Summary

`sklearnex.linear_model.Ridge` on GPU is very slow with `dpnp` arrays when the input is Fortran-contiguous. The same array API path with C-contiguous `dpnp` arrays is fast.

This showed up in a benchmark case using `make_regression(n_samples=10_000_000, n_features=2)` with `array_api_dispatch=True`, `format=dpnp`, and default benchmark `data:order=F`. The case appears to hang because a single fit/validation call runs far past the benchmark time limit.

### Reproducer

```python
import time

import dpnp
from sklearn.datasets import make_regression
from sklearnex import config_context
from sklearnex.linear_model import Ridge


def timed(label, fn, queue):
    t0 = time.perf_counter()
    out = fn()
    queue.wait()
    print(f"{label}: {time.perf_counter() - t0:.3f}s")
    return out


X0, y0 = make_regression(
    n_samples=1_000_000,
    n_features=2,
    n_informative=2,
    noise=0.1,
    random_state=0,
)
y0 = y0.reshape(-1, 1)

for order in ["C", "F"]:
    X = dpnp.asarray(X0, order=order)
    y = dpnp.asarray(y0, order=order)

    print(f"order={order}, device={X.device}")
    with config_context(
        array_api_dispatch=True,
        allow_fallback_to_host=False,
        allow_sklearn_after_onedal=False,
    ):
        timed("fit #1", lambda: Ridge().fit(X, y), X.sycl_queue)
        timed("fit #2", lambda: Ridge().fit(X, y), X.sycl_queue)
```

### Observed timings

On Intel Arc B390 GPU, Level Zero backend:

```text
order=C, device=Device(level_zero:gpu:0)
fit #1: 0.762s
fit #2: 0.067s
order=F, device=Device(level_zero:gpu:0)
fit #1: 13.610s
fit #2: 13.688s
```

Timing `onedal.utils.validation.check_all_finite(X)` and direct oneDAL Ridge fit are both
responsible in this slowdown.

### Expected

Fortran-contiguous `dpnp` input should not be orders of magnitude slower than C-contiguous `dpnp` input for this tall/narrow Ridge case, or it should be copied/normalized once to an efficient layout.

### Notes from source inspection

The Ridge GPU normal-equation path uses `row_accessor` and processes the table in row blocks.

The finiteness checker GPU path also converts the table to a device 1D ndarray before reduction.

### Local environment

```text
scikit-learn: 1.8.0
scikit-learn-intelex: 2199.9.9
onedal python: 2021.6
dpnp: 0.20.0
dpctl: 0.22.1
numpy: 2.4.6
device: Intel(R) Arc(TM) B390 GPU
backend: level_zero
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

F-order dpnp arrays make Ridge GPU array API extremely slow #3235

Summary

Reproducer

Observed timings

Expected

Notes from source inspection

Local environment

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

F-order dpnp arrays make Ridge GPU array API extremely slow #3235

Description

Summary

Reproducer

Observed timings

Expected

Notes from source inspection

Local environment

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions