-
Notifications
You must be signed in to change notification settings - Fork 153
fix: Preserve pandas column name attribute #2363
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
narwhals/_dask/dataframe.py
Outdated
@@ -110,7 +110,7 @@ def collect( | |||
from narwhals._pandas_like.dataframe import PandasLikeDataFrame | |||
|
|||
return PandasLikeDataFrame( | |||
result, | |||
result.rename_axis(columns=self.native.columns.name), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be one step ahead of dask#11874
narwhals/translate.py
Outdated
backend_version=parse_version(pd), | ||
implementation=Implementation.PANDAS, | ||
backend_version=parse_version(pd), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mental sanity for symmetry with other pandas-like and order of the spec π
|
||
df = nw.from_native(df_native) | ||
|
||
result = df.with_columns(b=nw.col("a") + 1, c=nw.col("a") * 2).select("c", "b") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to concatenate two methods here, mostly for the sake of it.
Might be worth reparametrizing it
narwhals/_dask/dataframe.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dask select
and with_columns
have no issues since we use .assign
method for both!
thanks! i'll do a little test to check there's no accidental / surprise overhead |
@MarcoGorelli I tried a simple script to check import timeit
from pprint import pprint
import numpy as np
import pandas as pd
def benchmark_rename_axis_overhead(n_rows: int, n_cols: int, repetitions: int = 1000) -> float:
df = pd.DataFrame(np.random.randn(n_rows, n_cols))
return timeit.timeit(
lambda: df.rename_axis(columns="new_name", copy=False),
number=repetitions
) / repetitions
TOTAL_SIZE = 10_000_000
results = {
(int(TOTAL_SIZE/10**i), 10**i): benchmark_rename_axis_overhead(n_rows=int(TOTAL_SIZE/10**i), n_cols=10**i)
for i in range(1, 6)
}
pprint(results)
Let me know what your thoughts are and if we need to check something more extensively |
Single argument + fake typing π
|
thanks!
i'm still slightly concerned about there being extra copies in old versions, i'll take another look |
No rush! I keep merging main to avoid having to deal with a large conflicting diff if that happens to be the case |
narwhals/_pandas_like/dataframe.py
Outdated
return self.__class__( | ||
df, | ||
rename_axis( | ||
df, | ||
implementation=self._implementation, | ||
backend_version=self._backend_version, | ||
columns=self._native_columns_name, | ||
), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
π€ i'm wondering if we could/should just set df.columns.name = self._native_columns_name
else we need to trust pandas' copy-on-write / copy=False
working properly, which i'm not sure i do
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I trust your judgement here - but should I would be concerned as a pandas user?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @FBruzzesi !
I don't trust pandas' rename_axis
, nor do I trust that copy-on-write will take effect properly, we would need more comprehensive benchmarks to be sure
but, in select
and with_columns
, we create new objects, so it should be safe to just use .columns.name = ...
there
What type of PR is this? (check all applicable)
Related issues
select
andwith_columns
don't always preserve column index name for pandas-likeΒ #1483Checklist