Skip to content

WIP: Enable array api support in neighbor#2700

Open
yuejiaointel wants to merge 150 commits intouxlfoundation:mainfrom
yuejiaointel:refactor_neighbor_array_api
Open

WIP: Enable array api support in neighbor#2700
yuejiaointel wants to merge 150 commits intouxlfoundation:mainfrom
yuejiaointel:refactor_neighbor_array_api

Conversation

@yuejiaointel
Copy link
Contributor

@yuejiaointel yuejiaointel commented Sep 30, 2025

Description

Follow up PR of #2284 (will rebase after this one is merged)
that refactor neighbors with array api standard

Summary of Neighbor Code Refactoring Changes

onedal layer changes:

  • Removed validation methods from onedal estimators: _validate_data, _get_weights, _validate_targets, _validate_n_classes - all validation moved to sklearnex layer
  • Removed unused imports: Integral, _get_config, and validation functions (_check_array, _check_classification_targets, _check_X_y, _column_or_1d, _num_samples)
  • Classification target processing (class encoding, outputs_2d_ setting) moved from onedal _fit() to sklearnex layer - onedal now expects pre-processed _y and classes_ attributes
  • Backend-specific formatting kept in onedal: GPU backend requires y in (-1, 1) shape for C++ compatibility
  • _kneighbors() method updated with full array API support: uses xp namespace for all operations (argsort, reshape, concatenate, arange, where, all)
  • Array API compatibility: changed sample_mask[:, 0][dup_gr_nbrs] = False to explicit operations using xp.where() and xp.concatenate() (array API doesn't support chained indexing assignment)
  • predict() and predict_proba() methods removed from KNeighborsClassifier - computation logic moved to sklearnex
  • _predict_skl() and predict() methods removed from KNeighborsRegressor - dispatch logic moved to sklearnex
  • Table conversion order changed: convert to table FIRST, then get params from table (ensures dtype normalization from array API dtype to numpy dtype)
  • _predict_gpu() kept in onedal for GPU backend support (called by sklearnex)

sklearnex layer changes:

  • Added @enable_array_api decorator to KNeighborsClassifier and KNeighborsRegressor (requires sklearn >= 1.5 for regressor due to y_numeric parameter)
  • Created comprehensive validation and computation methods in common.py:
    • _get_weights() - adapted from sklearn with array API support (handles dpctl/dpnp arrays)
    • _compute_weighted_prediction() - regression prediction using array API take() and stack() (array API take() only supports 1-D indices)
    • _compute_class_probabilities() - classification probabilities with array API support (avoids fancy indexing via sample-by-sample accumulation)
    • _predict_skl_regression() and _predict_skl_classification() - unified prediction helpers that call kneighbors() and compute results
    • _process_classification_targets() - handles class encoding, outputs_2d_ setting, validates n_classes >= 2
    • _process_regression_targets() - handles shape processing for regressors
    • _kneighbors_post_processing() - handles kd_tree sorting, query_is_train (X=None) case, return_distance decision
  • _onedal_fit() in all estimators now: validates data, sets effective metric, processes targets, then calls onedal backend
  • _onedal_predict() in classifier uses _predict_skl_classification() helper (handles X=None LOOCV case properly)
  • _onedal_predict_proba() computes probabilities directly in sklearnex using _compute_class_probabilities()
  • _onedal_predict() in regressor dispatches between GPU (_predict_gpu()) and SKL (_predict_skl()) paths based on device and weights
  • _onedal_kneighbors() simplified in all estimators: validates X, calls onedal, returns result (post-processing removed - now in onedal)
  • _onedal_score() in classifier uses _transfer_to_host() to convert array API arrays to numpy for sklearn's accuracy_score()
  • Added _validate_n_neighbors(), _set_effective_metric(), _validate_n_classes(), _kneighbors_validation() helper methods
  • _fit_validation() expanded with array API dtype support: dtype=[xp.float64, xp.float32]
  • _onedal_supported() updated for array API: uses get_namespace() and xp.asarray() for type checks
  • kneighbors_graph() uses _transfer_to_host() to ensure numpy arrays for scipy csr_matrix construction
  • Metric aliases mapping added: cityblockmanhattan, l1manhattan, l2euclidean for oneDAL compatibility

Testing changes:

  • onedal tests (onedal/neighbors/tests/test_knn_classification.py) deprecated with notice pointing to sklearnex tests
  • Tests moved to sklearnex/neighbors/tests/test_neighbors.py: test_knn_classifier_iris(), test_knn_classifier_pickle()
  • Added test_knn_classifier_single_class() with @pytest.mark.allow_sklearn_fallback (oneDAL doesn't support one-class case)
  • Removed queue parametrization from moved tests (sklearnex handles device selection internally)

Documentation changes:

  • Added KNeighborsClassifier, KNeighborsRegressor, NearestNeighbors, LocalOutlierFactor to array_api.rst showing array API support

Architecture pattern:

  • onedal layer = backend calls + full kneighbors post-processing + backend-specific formatting
  • sklearnex layer = validation + target processing + dispatch + prediction computation
  • Follows established pattern from PCA/SVM: validation in sklearnex, backend + post-processing in onedal

Array API compliance:

  • Cannot use chained indexing assignment: array[:, 0][mask] = value → explicit xp.where() + xp.concatenate()
  • Cannot use fancy indexing for __setitem__: proba_k[all_rows, idx] += weights → sample-by-sample accumulation
  • take() only supports 1-D indices: iterate over samples to gather neighbor values
  • Use _transfer_to_host() before sklearn utility functions (accuracy_score, csr_matrix) that require numpy

Checklist:

Completeness and readability

  • I have commented my code, particularly in hard-to-understand areas.
  • I have updated the documentation to reflect the changes or created a separate PR with updates and provided its number in the description, if necessary.
  • Git commit message contains an appropriate signed-off-by string (see CONTRIBUTING.md for details).
  • I have resolved any merge conflicts that might occur with the base branch.

Testing

  • I have run it locally and tested the changes extensively.
  • All CI jobs are green or I have provided justification why they aren't.
  • I have extended testing suite if new functionality was introduced in this PR.

Performance

  • I have measured performance for affected algorithms using scikit-learn_bench and provided at least a summary table with measured data, if performance change is expected.
  • I have provided justification why performance and/or quality metrics have changed or why changes are not expected.
  • I have extended the benchmarking suite and provided a corresponding scikit-learn_bench PR if new measurable functionality was introduced in this PR.

@icfaust icfaust mentioned this pull request Sep 30, 2025
13 tasks
@yuejiaointel yuejiaointel force-pushed the refactor_neighbor_array_api branch 2 times, most recently from a569e0c to 62c8ddd Compare October 9, 2025 05:33
@david-cortes-intel
Copy link
Contributor

As part of the PR, please add the relevant classes that will get array api support to this list now that they are documented:

@codecov
Copy link

codecov bot commented Oct 14, 2025

Codecov Report

❌ Patch coverage is 77.64706% with 76 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
onedal/spmd/neighbors/neighbors.py 3.57% 27 Missing ⚠️
sklearnex/neighbors/common.py 83.20% 11 Missing and 10 partials ⚠️
sklearnex/neighbors/knn_classification.py 83.58% 3 Missing and 8 partials ⚠️
sklearnex/neighbors/knn_regression.py 76.74% 6 Missing and 4 partials ⚠️
sklearnex/neighbors/knn_unsupervised.py 80.00% 3 Missing and 1 partial ⚠️
onedal/neighbors/neighbors.py 90.00% 1 Missing and 2 partials ⚠️
Flag Coverage Δ
azure 79.12% <76.76%> (-0.81%) ⬇️
github 81.42% <84.29%> (-0.49%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
sklearnex/_device_offload.py 85.00% <100.00%> (+0.38%) ⬆️
sklearnex/neighbors/_lof.py 100.00% <100.00%> (ø)
onedal/neighbors/neighbors.py 85.64% <90.00%> (+5.59%) ⬆️
sklearnex/neighbors/knn_unsupervised.py 91.78% <80.00%> (-6.47%) ⬇️
sklearnex/neighbors/knn_regression.py 89.69% <76.74%> (-10.31%) ⬇️
sklearnex/neighbors/knn_classification.py 91.47% <83.58%> (-7.18%) ⬇️
sklearnex/neighbors/common.py 88.42% <83.20%> (-4.07%) ⬇️
onedal/spmd/neighbors/neighbors.py 38.46% <3.57%> (-16.26%) ⬇️

... and 7 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@icfaust
Copy link
Contributor

icfaust commented Oct 21, 2025

/intelci: run

Copy link
Contributor

@ethanglaser ethanglaser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Several errors in private CI need to be addressed. If any help is needed reproducing/understanding the error messages let me know and I can help out.

@yuejiaointel
Copy link
Contributor Author

/intelci: run

2 similar comments
@yuejiaointel
Copy link
Contributor Author

/intelci: run

@yuejiaointel
Copy link
Contributor Author

/intelci: run

@yuejiaointel yuejiaointel marked this pull request as draft October 21, 2025 21:03
@yuejiaointel
Copy link
Contributor Author

/intelci: run

4 similar comments
@yuejiaointel
Copy link
Contributor Author

/intelci: run

@yuejiaointel
Copy link
Contributor Author

/intelci: run

@yuejiaointel
Copy link
Contributor Author

/intelci: run

@yuejiaointel
Copy link
Contributor Author

/intelci: run

@david-cortes-intel
Copy link
Contributor

david-cortes-intel commented Oct 22, 2025

The CI error appears to come from the changes in this PR:

File "<path>/build/onedal_linux_icx/__release_lnx/daal4py-3.9/onedal/neighbors/neighbors.py", line 137, in _fit
    raise ValueError(
ValueError: Classification target processing must be done in sklearnex layer before calling onedal fit. _y attribute is not set. This indicates the refactoring is incomplete.

Note that it occurs when using the SPMD class:

File "<path>/examples/sklearnex/knn_bf_classification_spmd.py", line 64, in <module>
    model_spmd.fit(dpt_X_train, dpt_y_train)

@yuejiaointel
Copy link
Contributor Author

/intelci: run

1 similar comment
@yuejiaointel
Copy link
Contributor Author

/intelci: run

@yuejiaointel
Copy link
Contributor Author

yuejiaointel commented Feb 12, 2026

@ethanglaser @yuejiaointel I see now KNN estimators are being tested with the same methods and inputs as both 'standard estimator' and 'special estimator' here: https://github.com/uxlfoundation/scikit-learn-intelex/blob/main/sklearnex/tests/test_patching.py https://github.com/uxlfoundation/scikit-learn-intelex/blob/main/sklearnex/tests/test_run_to_run_stability.py

It looks like the only difference is that the special ones call it with algorithm="brute", while the standard ones do not specify the algorithm:

KNeighborsClassifier(algorithm="brute"),

Is that because GPU support requires algorithm="brute", or is there some other reason why it needs to be under the "special" list?

And if this is the case, should KMeans also get added to the special list with the algorithm that is supported on GPU? Actually I now see that it has this note for GPU:

‘elkan’ falls back to ‘lloyd’

I think we need both standard and special because the standard (algorithem = "auto") would pick kd_tree with the test data (euclidean metric)

 if self._fit_method in ["auto", "ball_tree"]:
            condition = (
                self.n_neighbors is not None
                and self.n_neighbors >= self.n_samples_fit_ // 2
            )
            if self.n_features_in_ > 15 or condition:
                result_method = "brute"
            else:
                if self.effective_metric_ in ["euclidean"]:
                    result_method = "kd_tree"
                else:
                    result_method = "brute"
        else:
            result_method = self._fit_method

and algorithm = "brute" will not be tested so we need to test it in special instance

@yuejiaointel
Copy link
Contributor Author

/intelci: run

@yuejiaointel
Copy link
Contributor Author

/intelci: run

numpy.unique_inverse returns a plain tuple on some numpy/Python
version combinations (e.g., Python 3.13), not a namedtuple with
.values/.inverse_indices attributes. Use tuple unpacking which
works for both tuple and namedtuple return types.
The previous check rejected all non-SYCL data, but dispatch() transfers
dpctl/dpnp data to host (numpy) before the second _get_backend call.
This caused dpctl/dpnp GPU tests to incorrectly fall back to sklearn.

Fix: skip numpy arrays in the check (they are always valid), only
reject non-numpy arrays that lack __sycl_usm_array_interface__
(e.g. torch XPU tensors).
@yuejiaointel
Copy link
Contributor Author

/intelci: run

@yuejiaointel
Copy link
Contributor Author

/intelci: run ml-benchmarks
set get-build=f1086a06-aecb-f15c-a7c9-a4bf010d0e2d

2 similar comments
@yuejiaointel
Copy link
Contributor Author

/intelci: run ml-benchmarks
set get-build=f1086a06-aecb-f15c-a7c9-a4bf010d0e2d

@yuejiaointel
Copy link
Contributor Author

/intelci: run ml-benchmarks
set get-build=f1086a06-aecb-f15c-a7c9-a4bf010d0e2d

@david-cortes-intel
Copy link
Contributor

david-cortes-intel commented Feb 13, 2026

@yuejiaointel @ethanglaser Torch tensors with xpu device are SYCL arrays. They are supposed to execute on GPU. We used to have validation on torch back when there was a public GPU runner.

To verify whether something runs on oneDAL or not, or whether it runs on CPU, you might want to enable verbose mode for both libraries:

import os
os.environ["SKLEARNEX_VERBOSE"] = "INFO"
os.environ["ONEDAL_VERBOSE"] = "4"

If unsure about intended behaviors for edge cases, you might also want to check what our own documentation says:
https://uxlfoundation.github.io/scikit-learn-intelex/2025.10/array_api.html

It looks like SVMs are indeed broken for torch inputs, but LinearRegression on CPU and GPU for example works correctly with torch.

@david-cortes-intel
Copy link
Contributor

@yuejiaointel This is still outputting float64:

import os
os.environ["SCIPY_ARRAY_API"] = "1"

import numpy as np

rng = np.random.default_rng(seed=123)
X = rng.standard_normal(size=(1000, 20), dtype=np.float32)
y = rng.integers(3, size=X.shape[0])

from sklearnex import config_context
from sklearnex.neighbors import KNeighborsClassifier

import polars as pl
Xdf = pl.DataFrame(X)
ys = pl.Series(y)
with config_context(target_offload="gpu"):
    nn = KNeighborsClassifier().fit(Xdf, ys)
    neigh = nn.kneighbors_graph(Xdf)

Fixed! I can see it outputs float32 now cd /export/users/yuejiao && python -c "

import os
os.environ['SCIPY_ARRAY_API'] = '1'
import numpy as np
rng = np.random.default_rng(seed=123)
X = rng.standard_normal(size=(1000, 20), dtype=np.float32)
y = rng.integers(3, size=X.shape[0])
from sklearnex import config_context
from sklearnex.neighbors import KNeighborsClassifier
import polars as pl
Xdf = pl.DataFrame(X)
ys = pl.Series(y)
with config_context(target_offload='gpu'):
nn = KNeighborsClassifier().fit(Xdf, ys)
neigh = nn.kneighbors_graph(Xdf)
print(neigh.dtype)
"
float32

@yuejiaointel This is still outputting float64 for me. Are you using the latest versions of scikit-learn and scipy?

@david-cortes-intel
Copy link
Contributor

@yuejiaointel Getting another error with torch:

import os
os.environ["SCIPY_ARRAY_API"] = "1"

import numpy as np
import torch

rng = np.random.default_rng(seed=123)
X = rng.standard_normal(size=(1000, 20), dtype=np.float32)
Xt = torch.tensor(X, device="xpu")

from sklearnex import config_context
from sklearnex.neighbors import NearestNeighbors

with config_context(array_api_dispatch=True):
    nn = NearestNeighbors().fit(Xt)
    neigh = nn.kneighbors_graph(Xt)
TypeError: Object of type 'torch.dtype' is not an instance of 'dtype'

Fixed with same fix above python -c "

import os
os.environ['SCIPY_ARRAY_API'] = '1'
import numpy as np
import torch
rng = np.random.default_rng(seed=123)
X = rng.standard_normal(size=(1000, 20), dtype=np.float32)
Xt = torch.tensor(X, device='xpu')
from sklearnex import config_context
from sklearnex.neighbors import NearestNeighbors
with config_context(array_api_dispatch=True):
nn = NearestNeighbors().fit(Xt)
neigh = nn.kneighbors_graph(Xt)
print(f'Success! type: {type(neigh)}, shape: {neigh.shape}, dtype: {neigh.dtype}')
"
Success! type: <class 'scipy.sparse._csr.csr_matrix'>, shape: (1000, 1000), dtype: float32

This is now falling back to sklearn. It should execute on GPU.

if non_none_data:
_, is_array_api = get_namespace(*non_none_data)
if is_array_api and not any(
hasattr(x, "__sycl_usm_array_interface__") for x in non_none_data
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not part of the array API specification. It is a workaround back from a time when dpnp wasn't compliant with array API. Please use the helper tools that you'll find through the sklearnex module, such as here:

sycl_type, xp, is_array_api_compliant = _get_sycl_namespace(*arrays)

Copy link
Contributor Author

@yuejiaointel yuejiaointel Feb 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see using this anymore in common.py maybe it is old code

# Remove check for result __sycl_usm_array_interface__ on deprecation of use_raw_inputs
# For tuple/list results (e.g. kneighbors), check elements instead of the container
result_on_device = (
all(hasattr(r, "__sycl_usm_array_interface__") for r in result)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment here - please avoid checking or using this attribute.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thx, fixed!

y,
dtype=[xp.float64, xp.float32],
accept_sparse="csr",
multi_output=True,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If multi-output is now supported, then please add it to the support tables in the documentation:

- Multi-output and sparse data are not supported.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! This was added to fix a patching test warning(treated as error):
WARNING: DataConversionWarning : A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
No warnings with multi_output=True

but multi_output=true only controls the validation part, tells validate data to accept 2D y with no warning, it does not mean onedal actually run multi output, so no documentation update needed here

@david-cortes-intel
Copy link
Contributor

@yuejiaointel From the discussion here:
#2700 (comment)

Please port the rest of the LOF code that calls scikit-learn to array API for full compatibility:

self._lrd = self._local_reachability_density(

I've already created a separate PR to document support for it:
#2943

@yuejiaointel
Copy link
Contributor Author

@yuejiaointel This is still outputting float64:

import os
os.environ["SCIPY_ARRAY_API"] = "1"

import numpy as np

rng = np.random.default_rng(seed=123)
X = rng.standard_normal(size=(1000, 20), dtype=np.float32)
y = rng.integers(3, size=X.shape[0])

from sklearnex import config_context
from sklearnex.neighbors import KNeighborsClassifier

import polars as pl
Xdf = pl.DataFrame(X)
ys = pl.Series(y)
with config_context(target_offload="gpu"):
    nn = KNeighborsClassifier().fit(Xdf, ys)
    neigh = nn.kneighbors_graph(Xdf)

Fixed! I can see it outputs float32 now cd /export/users/yuejiao && python -c "

import os
os.environ['SCIPY_ARRAY_API'] = '1'
import numpy as np
rng = np.random.default_rng(seed=123)
X = rng.standard_normal(size=(1000, 20), dtype=np.float32)
y = rng.integers(3, size=X.shape[0])
from sklearnex import config_context
from sklearnex.neighbors import KNeighborsClassifier
import polars as pl
Xdf = pl.DataFrame(X)
ys = pl.Series(y)
with config_context(target_offload='gpu'):
nn = KNeighborsClassifier().fit(Xdf, ys)
neigh = nn.kneighbors_graph(Xdf)
print(neigh.dtype)
"
float32

@yuejiaointel This is still outputting float64 for me. Are you using the latest versions of scikit-learn and scipy?

After some investigation, I see other algorithms also don't preserve the data type
python -c "

import os
os.environ['SCIPY_ARRAY_API'] = '1'
import numpy as np
rng = np.random.default_rng(seed=123)
X = rng.standard_normal(size=(1000, 20), dtype=np.float32)
y = rng.standard_normal(size=X.shape[0]).astype(np.float32)
from sklearnex import config_context
from sklearnex.linear_model import LinearRegression
import polars as pl
Xdf = pl.DataFrame(X)
ys = pl.Series(y)
with config_context(target_offload='gpu'):
lr = LinearRegression().fit(Xdf, ys)
print('coef_ dtype:', lr.coef_.dtype)
pred = lr.predict(Xdf)
print('predict dtype:', pred.dtype)
"
coef_ dtype: float64
predict dtype: float64

pl dataframe don't have a dtype so sklearn check_array will use the first in the list passed in which is float64, an minimal example:
python -c "

import numpy as np, polars as pl
from sklearn.utils.validation import check_array

X = np.ones((3, 2), dtype=np.float32)

print(check_array(X, dtype=[np.float64, np.float32]).dtype) # numpy -> float32
print(check_array(pl.DataFrame(X), dtype=[np.float64, np.float32]).dtype) # polars -> float64
"
float32
float64

you can see here after check_array we have float64 even tho X is float32, for numpy we preserved the datatype but for dataframe it is just choosing the first dtype in list. One potential fix would be convert plars df to numpy before passing to check_array, but it seems to be beyond the scope of this pr

@yuejiaointel
Copy link
Contributor Author

yuejiaointel commented Feb 16, 2026

@yuejiaointel Getting another error with torch:

import os
os.environ["SCIPY_ARRAY_API"] = "1"

import numpy as np
import torch

rng = np.random.default_rng(seed=123)
X = rng.standard_normal(size=(1000, 20), dtype=np.float32)
Xt = torch.tensor(X, device="xpu")

from sklearnex import config_context
from sklearnex.neighbors import NearestNeighbors

with config_context(array_api_dispatch=True):
    nn = NearestNeighbors().fit(Xt)
    neigh = nn.kneighbors_graph(Xt)
TypeError: Object of type 'torch.dtype' is not an instance of 'dtype'

Fixed with same fix above python -c "

import os
os.environ['SCIPY_ARRAY_API'] = '1'
import numpy as np
import torch
rng = np.random.default_rng(seed=123)
X = rng.standard_normal(size=(1000, 20), dtype=np.float32)
Xt = torch.tensor(X, device='xpu')
from sklearnex import config_context
from sklearnex.neighbors import NearestNeighbors
with config_context(array_api_dispatch=True):
nn = NearestNeighbors().fit(Xt)
neigh = nn.kneighbors_graph(Xt)
print(f'Success! type: {type(neigh)}, shape: {neigh.shape}, dtype: {neigh.dtype}')
"
Success! type: <class 'scipy.sparse._csr.csr_matrix'>, shape: (1000, 1000), dtype: float32

This is now falling back to sklearn. It should execute on GPU.

fixed! now I can see it says when using SKLEARNEX_VERBOSE=INFO
INFO:sklearnex: sklearn.neighbors.NearestNeighbors.fit: running accelerated version on GPU
INFO:sklearnex: sklearn.neighbors.NearestNeighbors.kneighbors: running accelerated version on GPU

@yuejiaointel
Copy link
Contributor Author

@yuejiaointel From the discussion here: #2700 (comment)

Please port the rest of the LOF code that calls scikit-learn to array API for full compatibility:

self._lrd = self._local_reachability_density(

I've already created a separate PR to document support for it: #2943

added!

@yuejiaointel
Copy link
Contributor Author

/intelci: run

@yuejiaointel
Copy link
Contributor Author

/intelci: run

usm_iface := getattr(data, "__sycl_usm_array_interface__", None)
) and not hasattr(result, "__sycl_usm_array_interface__"):
queue = usm_iface["syclobj"]
sycl_type, _, _ = _get_sycl_namespace(data)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious here: why did it require changes for KNN if it was working for other algorithms?

@david-cortes-intel
Copy link
Contributor

you can see here after check_array we have float64 even tho X is float32, for numpy we preserved the datatype but for dataframe it is just choosing the first dtype in list. One potential fix would be convert plars df to numpy before passing to check_array, but it seems to be beyond the scope of this pr

Indeed, looks like something is not working correctly across several estimators. Scikit-learn would preserve the dtype in those cases:

import numpy as np
rng = np.random.default_rng(seed=123)
X = rng.standard_normal(size=(1000, 20), dtype=np.float32)
y = rng.standard_normal(size=X.shape[0]).astype(np.float32)
from sklearn import config_context
from sklearn.linear_model import LinearRegression
import polars as pl
Xdf = pl.DataFrame(X)
ys = pl.Series(y)
lr = LinearRegression().fit(Xdf, ys)
print('coef_ dtype:', lr.coef_.dtype)
pred = lr.predict(Xdf)
print('predict dtype:', pred.dtype)
coef_ dtype: float32
predict dtype: float32

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants