-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use narwhals
for DataFrame logic
#100
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #100 +/- ##
==========================================
- Coverage 94.39% 91.88% -2.51%
==========================================
Files 15 16 +1
Lines 1802 1960 +158
==========================================
+ Hits 1701 1801 +100
- Misses 101 159 +58 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
@@ -60,7 +61,7 @@ def test_simplify_output_raises(input): | |||
simplify_output(input) | |||
|
|||
|
|||
@pytest.mark.parametrize("backend", ["pd", "pd", "csr"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line contained a typo.
narwhals
for DataFrame logic
tests/test_slearner.py
Outdated
if backend == "pd": | ||
assert isinstance(X_with_w, pd.DataFrame) | ||
elif backend == "pl": | ||
assert isinstance(X_with_w, pd.DataFrame) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be pl
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It shouldn't have been but after reworking the way we handle treatment assignments in the S-Learner(#105) it now is expected to be pl
. :)
* Rework handling of categories in S-Learner. * Rename and add doc strings. * Fix test. * Move infer_native_namespace to _narwhals_utils module. * Add _narwhals_utils module. * Rename and adapt interface. * Remove dead code. * Test _narwhals_utils. * Test _np_to_dummies. * Test specific _append_treatment_ functions separately for easier navigation. * Fix indexing issue. * Get rid of branching. * Cast bool index to int. * Use constant instead of value. * Move _stringify_column_names to _narwhals_utils. * Add docstring. * Use schema at creation instead of rename after creation. * Add type hint for native_namespace parameter.
Co-authored-by: Francesco Bruzzesi <[email protected]>
Co-authored-by: Francesco Bruzzesi <[email protected]>
metalearners/_utils.py
Outdated
def index_matrix(matrix: Matrix, rows: Vector) -> Matrix: | ||
"""Subselect certain rows from a matrix.""" | ||
if isinstance(rows, pd.Series): | ||
if is_into_series(rows): | ||
if not hasattr(rows, "to_numpy"): | ||
raise ValueError("rows couldn't be converted to numpy.") | ||
rows = rows.to_numpy() | ||
if isinstance(matrix, pd.DataFrame): | ||
return matrix.iloc[rows] | ||
return matrix[rows, :] | ||
if isinstance(matrix, np.ndarray) or isinstance(matrix, csr_matrix): | ||
return matrix[rows, :] | ||
if is_into_dataframe(matrix): | ||
matrix_nw = nw.from_native(matrix, eager_only=True) | ||
if rows.dtype == "bool": | ||
return matrix_nw.filter(rows.tolist()).to_native() | ||
return matrix_nw[rows.tolist(), :].to_native() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have some concern on performances for the multiple conversions happening in here (and similarly in index_vector
).
Let me suggest a slightly different approach that triggers conversion to numpy and/or list only if needed:
def index_matrix(matrix: Matrix, rows: Vector) -> Matrix:
"""Subselect certain rows from a matrix."""
if not is_into_series(rows):
msg = "some message"
raise ValueError(msg)
rows_nw = nw.from_native(rows, series_only=True)
if isinstance(matrix, np.ndarray) or isinstance(matrix, csr_matrix):
return matrix[rows_nw.to_numpy(), :]
if is_into_dataframe(matrix):
matrix_nw = nw.from_native(matrix, eager_only=True)
if rows_nw.dtype is nw.Boolean():
return matrix_nw.filter(rows_nw).to_native()
return matrix_nw[rows_nw.tolist(), :].to_native()
Notice that in case of is_into_dataframe(matrix)
we avoid converting to numpy in both branches and in Boolean type case we avoid converting to a python list as well.
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for your detailed suggestion!
We would like support rows
being any Vector
. That implies that we are fine with it being a numpy.ndarray
. Taking that into consideration, one could slightly adapt your suggestion, e.g. as follows:
def index_matrix(matrix: Matrix, rows: Vector) -> Matrix:
if isinstance(matrix, np.ndarray) or isinstance(matrix, csr_matrix):
if isinstance(rows, np.ndarray):
return matrix[rows, :]
if is_into_series(rows):
rows_nw = nw.from_native(rows, series_only=True, eager_only=True)
return matrix[rows_np.to_numpy(), :]
if is_into_dataframe(matrix):
matrix_nw = nw.from_native(matrix, eager_only=True)
if isinstance(rows, np.ndarray):
if rows.dtype == "bool":
return matrix_nw.filter(rows.tolist()).to_native()
return matrix_nw[rows.tolist(), :].to_native()
if is_into_series(rows):
rows_nw = nw.from_native(rows, series_only=True, eager_only=True)
if rows_nw.dtype == nw.Boolean:
return matrix_nw.filter(rows_nw).to_native()
return matrix_nw[rows_nw.to_list(), :].to_native()
raise ValueError(
f"rows to index matrix with are of unexpected type: {type(rows)}"
)
raise ValueError(f"matrix to be indexed is of unexpected type: {type(matrix)}")
This is great in that one saves conversions. Yet, this also means that the implementation no longer works if:
matrix
is apandas.DataFrame
androws
is apolars.Series
matrix
is apolars.DataFrame
androws
is apandas.Series
Off the top of my head I can't imagine a reasonable scenario where we could end up with a pandas
and a polars
object. On the other hand I'm not sure about it and enforcing this restriction would need to be propagated up from this place to end-user facing interfaces.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a TODO. :)
Co-authored-by: Francesco Bruzzesi <[email protected]>
Thanks a lot @FBruzzesi, @MarcoGorelli and @moritzwilksch for your feedback! I really appreciate you being so invested. Apologies about this PR taking so long to finalize; I've been extraordinarily caught up in things. I will let you know as soon as we publish a new release of the library. I'll also write a small company blogpost about our migration from pandas to narwhals logic. :) |
No need to apologize 🚀
Great job @kklein! Looking forward to the blog post and the release. In the meanwhile, I added metalearners in Narwhals used by/ecosystem sections. We tend to be extra careful and add downstream libraries tests in our CI. I noticed that metalearners CI is a bit slow compared to how quick we like to iterate on PRs/commits. I wonder if that's enough to run I am not a pixi user, and I am not sure how that would look like or even possible. |
Motivation
As described in issue #88, we would like to offer support for end-users feeding in
X
,w
andy
aspolars
data structures. See the documentation of the RLearners' fit method for an example use of these parameters. Prior to this PR,X
could either benp.ndarray
,pd.DataFrame
orscipy.sparse.csr_matrix
w
could either benp.ndarray
orpd.Series
y
could either benp.ndarray
orpd.Series
These expectations are also mirrored in these type definitions:
Matrix
,Vector
In principle, we could go down the road of adding a conditional branch whenever we operate on an input matrix or vector, assessing whether the data structure at hand is a
polars
data structure and if so, implementpolars
-specific logic. This approach was drafted in #90. The downsides of this approach include that:polars
as an additional run dependency of the library.In order to avoid these shortcomings, we would like to use
narwhals
and abstract away frompandas
andpolars
.narwhals
comes with very few dependencies and allows for a single dataframe library control flow. [NOTE: We do addpolars
as a dependency now, nevertheless.]Changes
pandas
operations bynarwhals
operations.narwhals
as a run dependency.polars
as a run dependency.polars
as a backend to various tests operating on input matrices or vectors.Comments
pandas
logic. Certain operations, e.g.Explainer.feature_importances
, always returnpandas
data structures, irrespective of input. We might want to think about parametrizing the kind of the return value data structure in the future. For now, we care mostly about enabling the usage ofpolars
data structures forX
,w
andy
.polars
in a follow-up docs PR.Testing
polars
as a backend.numpy
or onlypandas
). So far, we have not replicated all of these tests withpolars
input for runtime concerns.pandas
input now rely onnarwhals
logic and run through successfully.