Use `narwhals` for DataFrame logic #100

kklein · 2025-02-05T20:10:41Z

Motivation

As described in issue #88, we would like to offer support for end-users feeding in X, w and y as polars data structures. See the documentation of the RLearners' fit method for an example use of these parameters. Prior to this PR,

X could either be np.ndarray, pd.DataFrame or scipy.sparse.csr_matrix
w could either be np.ndarray or pd.Series
y could either be np.ndarray or pd.Series

These expectations are also mirrored in these type definitions: Matrix, Vector

In principle, we could go down the road of adding a conditional branch whenever we operate on an input matrix or vector, assessing whether the data structure at hand is a polars data structure and if so, implement polars-specific logic. This approach was drafted in #90. The downsides of this approach include that:

We would likely need to include polars as an additional run dependency of the library.
There would be several implementations of the same logic. This, in turn, would bloat the code base and increase the risk of inconsistent behaviour.

In order to avoid these shortcomings, we would like to use narwhals and abstract away from pandas and polars. narwhals comes with very few dependencies and allows for a single dataframe library control flow. [NOTE: We do add polars as a dependency now, nevertheless.]

Changes

Replace internal pandas operations by narwhals operations.
Add narwhals as a run dependency.
Add polars as a run dependency.
Add polars as a backend to various tests operating on input matrices or vectors.

Comments

Note that this PR doesn't get rid of all pandas logic. Certain operations, e.g. Explainer.feature_importances, always return pandas data structures, irrespective of input. We might want to think about parametrizing the kind of the return value data structure in the future. For now, we care mostly about enabling the usage of polars data structures for X, w and y.
I would suggest to limit the scope of this PR to code changes. Hence, I would suggest to address the newly added support of polars in a follow-up docs PR.

Testing

All tests which explicitly tested against several backends before this PR, now also test against polars as a backend.
Many tests are run against one type of data structure (only numpy or only pandas). So far, we have not replicated all of these tests with polars input for runtime concerns.
All tests relying on pandas input now rely on narwhals logic and run through successfully.

codecov · 2025-02-06T09:41:31Z

Codecov Report

Attention: Patch coverage is 73.39450% with 58 lines in your changes missing coverage. Please review.

Project coverage is 91.88%. Comparing base (daeadc2) to head (e95897f).
Report is 7 commits behind head on main.

Files with missing lines	Patch %	Lines
metalearners/_utils.py	53.12%	45 Missing ⚠️
metalearners/_narwhals_utils.py	81.48%	5 Missing ⚠️
metalearners/metalearner.py	76.47%	4 Missing ⚠️
metalearners/slearner.py	93.54%	4 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #100      +/-   ##
==========================================
- Coverage   94.39%   91.88%   -2.51%     
==========================================
  Files          15       16       +1     
  Lines        1802     1960     +158     
==========================================
+ Hits         1701     1801     +100     
- Misses        101      159      +58

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

kklein · 2025-02-06T15:24:12Z

tests/test_utils.py

@@ -60,7 +61,7 @@ def test_simplify_output_raises(input):
        simplify_output(input)


-@pytest.mark.parametrize("backend", ["pd", "pd", "csr"])


This line contained a typo.

metalearners/tlearner.py

moritzwilksch · 2025-02-24T20:11:24Z

tests/test_slearner.py

+        if backend == "pd":
+            assert isinstance(X_with_w, pd.DataFrame)
+        elif backend == "pl":
+            assert isinstance(X_with_w, pd.DataFrame)


should this be pl?

It shouldn't have been but after reworking the way we handle treatment assignments in the S-Learner(#105) it now is expected to be pl. :)

* Rework handling of categories in S-Learner. * Rename and add doc strings. * Fix test. * Move infer_native_namespace to _narwhals_utils module. * Add _narwhals_utils module. * Rename and adapt interface. * Remove dead code. * Test _narwhals_utils. * Test _np_to_dummies. * Test specific _append_treatment_ functions separately for easier navigation. * Fix indexing issue. * Get rid of branching. * Cast bool index to int. * Use constant instead of value. * Move _stringify_column_names to _narwhals_utils. * Add docstring. * Use schema at creation instead of rename after creation. * Add type hint for native_namespace parameter.

metalearners/_narwhals_utils.py

Co-authored-by: Francesco Bruzzesi <[email protected]>

metalearners/_narwhals_utils.py

FBruzzesi · 2025-02-27T08:22:51Z

metalearners/_utils.py

 def index_matrix(matrix: Matrix, rows: Vector) -> Matrix:
    """Subselect certain rows from a matrix."""
-    if isinstance(rows, pd.Series):
+    if is_into_series(rows):
+        if not hasattr(rows, "to_numpy"):
+            raise ValueError("rows couldn't be converted to numpy.")
        rows = rows.to_numpy()
-    if isinstance(matrix, pd.DataFrame):
-        return matrix.iloc[rows]
-    return matrix[rows, :]
+    if isinstance(matrix, np.ndarray) or isinstance(matrix, csr_matrix):
+        return matrix[rows, :]
+    if is_into_dataframe(matrix):
+        matrix_nw = nw.from_native(matrix, eager_only=True)
+        if rows.dtype == "bool":
+            return matrix_nw.filter(rows.tolist()).to_native()
+        return matrix_nw[rows.tolist(), :].to_native()


I have some concern on performances for the multiple conversions happening in here (and similarly in index_vector).

Let me suggest a slightly different approach that triggers conversion to numpy and/or list only if needed:

def index_matrix(matrix: Matrix, rows: Vector) -> Matrix: """Subselect certain rows from a matrix.""" if not is_into_series(rows): msg = "some message" raise ValueError(msg) rows_nw = nw.from_native(rows, series_only=True) if isinstance(matrix, np.ndarray) or isinstance(matrix, csr_matrix): return matrix[rows_nw.to_numpy(), :] if is_into_dataframe(matrix): matrix_nw = nw.from_native(matrix, eager_only=True) if rows_nw.dtype is nw.Boolean(): return matrix_nw.filter(rows_nw).to_native() return matrix_nw[rows_nw.tolist(), :].to_native()

Notice that in case of is_into_dataframe(matrix) we avoid converting to numpy in both branches and in Boolean type case we avoid converting to a python list as well.

WDYT?

Thanks a lot for your detailed suggestion!

We would like support rows being any Vector. That implies that we are fine with it being a numpy.ndarray. Taking that into consideration, one could slightly adapt your suggestion, e.g. as follows:

def index_matrix(matrix: Matrix, rows: Vector) -> Matrix: if isinstance(matrix, np.ndarray) or isinstance(matrix, csr_matrix): if isinstance(rows, np.ndarray): return matrix[rows, :] if is_into_series(rows): rows_nw = nw.from_native(rows, series_only=True, eager_only=True) return matrix[rows_np.to_numpy(), :] if is_into_dataframe(matrix): matrix_nw = nw.from_native(matrix, eager_only=True) if isinstance(rows, np.ndarray): if rows.dtype == "bool": return matrix_nw.filter(rows.tolist()).to_native() return matrix_nw[rows.tolist(), :].to_native() if is_into_series(rows): rows_nw = nw.from_native(rows, series_only=True, eager_only=True) if rows_nw.dtype == nw.Boolean: return matrix_nw.filter(rows_nw).to_native() return matrix_nw[rows_nw.to_list(), :].to_native() raise ValueError( f"rows to index matrix with are of unexpected type: {type(rows)}" ) raise ValueError(f"matrix to be indexed is of unexpected type: {type(matrix)}")

This is great in that one saves conversions. Yet, this also means that the implementation no longer works if:

matrix is a pandas.DataFrame and rows is a polars.Series

matrix is a polars.DataFrame and rows is a pandas.Series

Off the top of my head I can't imagine a reasonable scenario where we could end up with a pandas and a polars object. On the other hand I'm not sure about it and enforcing this restriction would need to be propagated up from this place to end-user facing interfaces.

I added a TODO. :)

Co-authored-by: Francesco Bruzzesi <[email protected]>

kklein · 2025-03-09T14:42:05Z

Thanks a lot @FBruzzesi, @MarcoGorelli and @moritzwilksch for your feedback! I really appreciate you being so invested.

Apologies about this PR taking so long to finalize; I've been extraordinarily caught up in things.

I will let you know as soon as we publish a new release of the library. I'll also write a small company blogpost about our migration from pandas to narwhals logic. :)

FBruzzesi · 2025-03-14T13:17:57Z

Apologies about this PR taking so long to finalize; I've been extraordinarily caught up in things.

No need to apologize 🚀

I will let you know as soon as we publish a new release of the library. I'll also write a small company blogpost about our migration from pandas to narwhals logic. :)

Great job @kklein! Looking forward to the blog post and the release.

In the meanwhile, I added metalearners in Narwhals used by/ecosystem sections.

We tend to be extra careful and add downstream libraries tests in our CI. I noticed that metalearners CI is a bit slow compared to how quick we like to iterate on PRs/commits. I wonder if that's enough to run tests/test__narwhals_utils.py test in our CI, and/or you suggest a subset of the entire test suite - I did something similar for darts.

I am not a pixi user, and I am not sure how that would look like or even possible.

kklein added 12 commits February 5, 2025 14:55

Fix type annotation.

32ebeac

Use narwhals for indexing utils.

5b3a419

Return early.

2a78f0a

Use narwhals for subselection of columns.

b7be471

Use narwhals for conversion of treatment assignments.

91328e1

Use narwhals in slearner.

01d1f4f

Use narwhals in treatment conversion.

c0365cd

Add polars as test dependency.

04b949d

Add polars to append_treatment test.

480006c

Don't convert column name types.

621d662

Add polars backend to test_feature_set.

d86cc2a

Handle ints feature set lookup.

c5ef4d0

kklein added 8 commits February 6, 2025 11:15

Test indexing against polars.

e4c4d07

Test fixed_binary_propensity against polars.

5733dbc

Add polars backend to test_crossfitestimator_oos_smoke.

c4f15cf

Add polars backend to test_validate_outcome_* tests.

83dd478

Fix docstring.

30483c2

Add polars backend to test_evaluate_feature_set_smoke.

ade0b9c

Properly filter vectors if boolean mask is given.

942a3de

Subselect with index helper.

ccda67b

kklein commented Feb 6, 2025

View reviewed changes

kklein added 4 commits February 6, 2025 16:49

Add test case for boolean masking.

7d39bb5

Add test case for boolean masking.

8626100

Support boolean masking for matrices.

d14e063

Adapt learners to handle non-np, non-pd input datastructures.

5567c17

kklein linked an issue Feb 7, 2025 that may be closed by this pull request

Add support for polars #88

Closed

kklein mentioned this pull request Feb 7, 2025

Add support for polars #90

Closed

1 task

kklein changed the title ~~Use narwhals for dataframe logic~~ Use narwhals for DataFrame logic Feb 11, 2025

Merge branch 'main' of github.com:Quantco/metalearners into narwhals

e0ccd75

kklein added 2 commits February 24, 2025 17:41

Merge branch 'main' of github.com:Quantco/metalearners into narwhals

e7fb130

Fix typo.

2d87255

moritzwilksch reviewed Feb 24, 2025

View reviewed changes

metalearners/tlearner.py Outdated Show resolved Hide resolved

moritzwilksch reviewed Feb 24, 2025

View reviewed changes

kklein and others added 3 commits February 26, 2025 12:07

Remove dead breakpoint.

47a7eab

Enable mypy.

ba0ac26

kklein requested a review from moritzwilksch February 26, 2025 14:59

Include csr_matrix in check for matrix.

3ad4f5b

kklein commented Feb 26, 2025

View reviewed changes

metalearners/_narwhals_utils.py Show resolved Hide resolved

FBruzzesi reviewed Feb 26, 2025

View reviewed changes

metalearners/_narwhals_utils.py Outdated Show resolved Hide resolved

FBruzzesi reviewed Feb 26, 2025

View reviewed changes

metalearners/_narwhals_utils.py Outdated Show resolved Hide resolved

kklein and others added 3 commits February 26, 2025 18:33

Update metalearners/_narwhals_utils.py

5e4b5b7

Co-authored-by: Francesco Bruzzesi <[email protected]>

Update metalearners/_narwhals_utils.py

740a5a3

Co-authored-by: Francesco Bruzzesi <[email protected]>

Apply pchs.

f37f7e3

moritzwilksch reviewed Feb 26, 2025

View reviewed changes

metalearners/_narwhals_utils.py Show resolved Hide resolved

kklein added 3 commits February 27, 2025 00:29

Check for null values in nw one-hot-encoding.

f535c40

Merge branch 'main' of github.com:Quantco/metalearners into narwhals

cbcd50e

Use md instead of rst in docstrings.

0c0f415

FBruzzesi reviewed Feb 27, 2025

View reviewed changes

metalearners/_narwhals_utils.py Outdated Show resolved Hide resolved

FBruzzesi reviewed Feb 27, 2025

View reviewed changes

kklein and others added 5 commits February 27, 2025 12:39

Update metalearners/_narwhals_utils.py

ae7f613

Co-authored-by: Francesco Bruzzesi <[email protected]>

Hotfix issue regarding interaction between lightgbm and polars.

f7dc89d

Add todo.

8721b5c

Update lock.

0e595fc

Fix test.

e95897f

kklein merged commit 6ad69da into main Mar 8, 2025
13 of 16 checks passed

kklein deleted the narwhals branch March 8, 2025 23:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Use `narwhals` for DataFrame logic #100

Use `narwhals` for DataFrame logic #100

Uh oh!

kklein commented Feb 5, 2025 •

edited

Loading

Uh oh!

codecov bot commented Feb 6, 2025 •

edited

Loading

Uh oh!

kklein Feb 6, 2025

Uh oh!

Uh oh!

moritzwilksch Feb 24, 2025

Uh oh!

kklein Feb 26, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

FBruzzesi Feb 27, 2025 •

edited

Loading

Uh oh!

kklein Feb 27, 2025 •

edited

Loading

Uh oh!

kklein Mar 8, 2025

Uh oh!

Uh oh!

kklein commented Mar 9, 2025

Uh oh!

FBruzzesi commented Mar 14, 2025

Uh oh!

Uh oh!

		@@ -60,7 +61,7 @@ def test_simplify_output_raises(input):
		simplify_output(input)


		@pytest.mark.parametrize("backend", ["pd", "pd", "csr"])

Use narwhals for DataFrame logic #100

Use narwhals for DataFrame logic #100

Uh oh!

Conversation

kklein commented Feb 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Changes

Comments

Testing

Uh oh!

codecov bot commented Feb 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

kklein Feb 6, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

moritzwilksch Feb 24, 2025

Choose a reason for hiding this comment

Uh oh!

kklein Feb 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

FBruzzesi Feb 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kklein Feb 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kklein Mar 8, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kklein commented Mar 9, 2025

Uh oh!

FBruzzesi commented Mar 14, 2025

Uh oh!

Uh oh!

Use `narwhals` for DataFrame logic #100

Use `narwhals` for DataFrame logic #100

kklein commented Feb 5, 2025 •

edited

Loading

codecov bot commented Feb 6, 2025 •

edited

Loading

kklein Feb 26, 2025 •

edited

Loading

FBruzzesi Feb 27, 2025 •

edited

Loading

kklein Feb 27, 2025 •

edited

Loading