Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use narwhals for DataFrame logic #100

Merged
merged 57 commits into from
Mar 8, 2025
Merged

Use narwhals for DataFrame logic #100

merged 57 commits into from
Mar 8, 2025

Conversation

kklein
Copy link
Collaborator

@kklein kklein commented Feb 5, 2025

Motivation

As described in issue #88, we would like to offer support for end-users feeding in X, w and y as polars data structures. See the documentation of the RLearners' fit method for an example use of these parameters. Prior to this PR,

  • X could either be np.ndarray, pd.DataFrame or scipy.sparse.csr_matrix
  • w could either be np.ndarray or pd.Series
  • y could either be np.ndarray or pd.Series

These expectations are also mirrored in these type definitions: Matrix, Vector

In principle, we could go down the road of adding a conditional branch whenever we operate on an input matrix or vector, assessing whether the data structure at hand is a polars data structure and if so, implement polars-specific logic. This approach was drafted in #90. The downsides of this approach include that:

  • We would likely need to include polars as an additional run dependency of the library.
  • There would be several implementations of the same logic. This, in turn, would bloat the code base and increase the risk of inconsistent behaviour.

In order to avoid these shortcomings, we would like to use narwhals and abstract away from pandas and polars. narwhals comes with very few dependencies and allows for a single dataframe library control flow. [NOTE: We do add polars as a dependency now, nevertheless.]

Changes

  • Replace internal pandas operations by narwhals operations.
  • Add narwhals as a run dependency.
  • Add polars as a run dependency.
  • Add polars as a backend to various tests operating on input matrices or vectors.

Comments

  • Note that this PR doesn't get rid of all pandas logic. Certain operations, e.g. Explainer.feature_importances, always return pandas data structures, irrespective of input. We might want to think about parametrizing the kind of the return value data structure in the future. For now, we care mostly about enabling the usage of polars data structures for X, w and y.
  • I would suggest to limit the scope of this PR to code changes. Hence, I would suggest to address the newly added support of polars in a follow-up docs PR.

Testing

  • All tests which explicitly tested against several backends before this PR, now also test against polars as a backend.
  • Many tests are run against one type of data structure (only numpy or only pandas). So far, we have not replicated all of these tests with polars input for runtime concerns.
  • All tests relying on pandas input now rely on narwhals logic and run through successfully.

Copy link

codecov bot commented Feb 6, 2025

Codecov Report

Attention: Patch coverage is 73.39450% with 58 lines in your changes missing coverage. Please review.

Project coverage is 91.88%. Comparing base (daeadc2) to head (e95897f).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
metalearners/_utils.py 53.12% 45 Missing ⚠️
metalearners/_narwhals_utils.py 81.48% 5 Missing ⚠️
metalearners/metalearner.py 76.47% 4 Missing ⚠️
metalearners/slearner.py 93.54% 4 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #100      +/-   ##
==========================================
- Coverage   94.39%   91.88%   -2.51%     
==========================================
  Files          15       16       +1     
  Lines        1802     1960     +158     
==========================================
+ Hits         1701     1801     +100     
- Misses        101      159      +58     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@@ -60,7 +61,7 @@ def test_simplify_output_raises(input):
simplify_output(input)


@pytest.mark.parametrize("backend", ["pd", "pd", "csr"])
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line contained a typo.

@kklein kklein linked an issue Feb 7, 2025 that may be closed by this pull request
@kklein kklein mentioned this pull request Feb 7, 2025
1 task
@kklein kklein changed the title Use narwhals for dataframe logic Use narwhals for DataFrame logic Feb 11, 2025
if backend == "pd":
assert isinstance(X_with_w, pd.DataFrame)
elif backend == "pl":
assert isinstance(X_with_w, pd.DataFrame)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be pl?

Copy link
Collaborator Author

@kklein kklein Feb 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It shouldn't have been but after reworking the way we handle treatment assignments in the S-Learner(#105) it now is expected to be pl. :)

kklein and others added 3 commits February 26, 2025 12:07
* Rework handling of categories in S-Learner.

* Rename and add doc strings.

* Fix test.

* Move infer_native_namespace to _narwhals_utils module.

* Add _narwhals_utils module.

* Rename and adapt interface.

* Remove dead code.

* Test _narwhals_utils.

* Test _np_to_dummies.

* Test specific _append_treatment_ functions separately for easier navigation.

* Fix indexing issue.

* Get rid of branching.

* Cast bool index to int.

* Use constant instead of value.

* Move _stringify_column_names to _narwhals_utils.

* Add docstring.

* Use schema at creation instead of rename after creation.

* Add type hint for native_namespace parameter.
@kklein kklein requested a review from moritzwilksch February 26, 2025 14:59
Comment on lines 40 to 52
def index_matrix(matrix: Matrix, rows: Vector) -> Matrix:
"""Subselect certain rows from a matrix."""
if isinstance(rows, pd.Series):
if is_into_series(rows):
if not hasattr(rows, "to_numpy"):
raise ValueError("rows couldn't be converted to numpy.")
rows = rows.to_numpy()
if isinstance(matrix, pd.DataFrame):
return matrix.iloc[rows]
return matrix[rows, :]
if isinstance(matrix, np.ndarray) or isinstance(matrix, csr_matrix):
return matrix[rows, :]
if is_into_dataframe(matrix):
matrix_nw = nw.from_native(matrix, eager_only=True)
if rows.dtype == "bool":
return matrix_nw.filter(rows.tolist()).to_native()
return matrix_nw[rows.tolist(), :].to_native()
Copy link
Contributor

@FBruzzesi FBruzzesi Feb 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some concern on performances for the multiple conversions happening in here (and similarly in index_vector).

Let me suggest a slightly different approach that triggers conversion to numpy and/or list only if needed:

def index_matrix(matrix: Matrix, rows: Vector) -> Matrix:
    """Subselect certain rows from a matrix."""
    if not is_into_series(rows):
        msg = "some message"
        raise ValueError(msg)
    
    rows_nw = nw.from_native(rows, series_only=True)

    if isinstance(matrix, np.ndarray) or isinstance(matrix, csr_matrix):
        return matrix[rows_nw.to_numpy(), :]
    
    if is_into_dataframe(matrix):
        matrix_nw = nw.from_native(matrix, eager_only=True)
        if rows_nw.dtype is nw.Boolean():
            return matrix_nw.filter(rows_nw).to_native()
        return matrix_nw[rows_nw.tolist(), :].to_native()

Notice that in case of is_into_dataframe(matrix) we avoid converting to numpy in both branches and in Boolean type case we avoid converting to a python list as well.

WDYT?

Copy link
Collaborator Author

@kklein kklein Feb 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for your detailed suggestion!

We would like support rows being any Vector. That implies that we are fine with it being a numpy.ndarray. Taking that into consideration, one could slightly adapt your suggestion, e.g. as follows:

def index_matrix(matrix: Matrix, rows: Vector) -> Matrix:
    if isinstance(matrix, np.ndarray) or isinstance(matrix, csr_matrix):
        if isinstance(rows, np.ndarray):
            return matrix[rows, :]
        if is_into_series(rows):
            rows_nw = nw.from_native(rows, series_only=True, eager_only=True)
            return matrix[rows_np.to_numpy(), :]

    if is_into_dataframe(matrix):
        matrix_nw = nw.from_native(matrix, eager_only=True)

        if isinstance(rows, np.ndarray):
            if rows.dtype == "bool":
                return matrix_nw.filter(rows.tolist()).to_native()
            return matrix_nw[rows.tolist(), :].to_native()
        if is_into_series(rows):
            rows_nw = nw.from_native(rows, series_only=True, eager_only=True)
            if rows_nw.dtype == nw.Boolean:
                return matrix_nw.filter(rows_nw).to_native()
            return matrix_nw[rows_nw.to_list(), :].to_native()

        raise ValueError(
            f"rows to index matrix with are of unexpected type: {type(rows)}"
        )

    raise ValueError(f"matrix to be indexed is of unexpected type: {type(matrix)}")

This is great in that one saves conversions. Yet, this also means that the implementation no longer works if:

  • matrix is a pandas.DataFrame and rows is a polars.Series
  • matrix is a polars.DataFrame and rows is a pandas.Series

Off the top of my head I can't imagine a reasonable scenario where we could end up with a pandas and a polars object. On the other hand I'm not sure about it and enforcing this restriction would need to be propagated up from this place to end-user facing interfaces.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a TODO. :)

@kklein kklein merged commit 6ad69da into main Mar 8, 2025
13 of 16 checks passed
@kklein kklein deleted the narwhals branch March 8, 2025 23:00
@kklein
Copy link
Collaborator Author

kklein commented Mar 9, 2025

Thanks a lot @FBruzzesi, @MarcoGorelli and @moritzwilksch for your feedback! I really appreciate you being so invested.

Apologies about this PR taking so long to finalize; I've been extraordinarily caught up in things.

I will let you know as soon as we publish a new release of the library. I'll also write a small company blogpost about our migration from pandas to narwhals logic. :)

@FBruzzesi
Copy link
Contributor

Apologies about this PR taking so long to finalize; I've been extraordinarily caught up in things.

No need to apologize 🚀

I will let you know as soon as we publish a new release of the library. I'll also write a small company blogpost about our migration from pandas to narwhals logic. :)

Great job @kklein! Looking forward to the blog post and the release.

In the meanwhile, I added metalearners in Narwhals used by/ecosystem sections.

We tend to be extra careful and add downstream libraries tests in our CI. I noticed that metalearners CI is a bit slow compared to how quick we like to iterate on PRs/commits. I wonder if that's enough to run tests/test__narwhals_utils.py test in our CI, and/or you suggest a subset of the entire test suite - I did something similar for darts.

I am not a pixi user, and I am not sure how that would look like or even possible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support for polars
5 participants