-
Notifications
You must be signed in to change notification settings - Fork 945
Narwhals implementation of from_dataframe
and performance benchmark
#2661
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Narwhals implementation of from_dataframe
and performance benchmark
#2661
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for giving this a go!
I've left a couple of comments
I suspect the .to_list()
calls may be responsible for the slow-down. I'll take a look
Hi @MarcoGorelli , Thanks for already looking at this and for your insights! |
Hi @MarcoGorelli, I investigated the issue, and it appears that the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @authierj for the effort on this! We really appreciate it! I left very non-relevant comments 😂
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #2661 +/- ##
==========================================
- Coverage 94.17% 94.09% -0.09%
==========================================
Files 141 141
Lines 15582 15601 +19
==========================================
+ Hits 14674 14679 +5
- Misses 908 922 +14 ☔ View full report in Codecov by Sentry. |
…om/authierj/darts into feature/add_timeseries_from_polars
To compare the performance of the methods Averaged over 10 runs, the processing times are as follows:
Therefore, As a consequence of this significant result, I will change the implementation of |
Co-authored-by: Dennis Bader <[email protected]>
Co-authored-by: Dennis Bader <[email protected]>
…om/authierj/darts into feature/add_timeseries_from_polars
This is very cool and I'm sure will make many users' lives easier! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @authierj , I left a few suggestions and considerations in the from_*
functions! Hope they help :)
darts/timeseries.py
Outdated
raise_log( | ||
ValueError( | ||
"No time column or index found in the DataFrame. `time_col=None` " | ||
"is only supported for pandas DataFrame which is indexed with one of the " | ||
"supported index types: a DatetimeIndex, a RangeIndex, or an integer " | ||
"Index that can be converted into a RangeIndex.", | ||
), | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should you consider the value to be np.arange(len(df))
or is that too big of an assumption?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do! The condition np.issubdtype(time_index.dtype, np.integer)
is True
if the index is np.arange(len(df))
:)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe what @FBruzzesi meant was that if no time_col
is given, and the DF doesn't have an index (I assume that's what's the case with polars), should we assign a range index? Otherwise, the user would have to add this index manually to the polars df.
For pandas this case will never exist, but for the others.
We could do below, and for the beginning raise a warning instead of an error:
if time_index is None:
time_index = pd.RangeIndex(len(df))
logger.info(
"No time column specified (`time_col=None`) and no index found in the DataFrame. Defaulting to "
"`pandas.RangeIndex(len(df))`. If this is not desired consider adding a time column "
"to your dataframe and defining `time_col`."
)
elif not ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes exactly, non pandas cases would end up raising if time_col
is not provided. It's a design choice you will have to make, but wanted to point out that that was the case :)
Agreed @hrzn :) To any dataframe support will be added in another PR. |
Co-authored-by: Dennis Bader <[email protected]>
Co-authored-by: Francesco Bruzzesi <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice @authierj 🚀 This looks great!
Also thanks @FBruzzesi for the additional comments.
Just some minor suggestions, then we're ready.
darts/timeseries.py
Outdated
raise_log( | ||
ValueError( | ||
"No time column or index found in the DataFrame. `time_col=None` " | ||
"is only supported for pandas DataFrame which is indexed with one of the " | ||
"supported index types: a DatetimeIndex, a RangeIndex, or an integer " | ||
"Index that can be converted into a RangeIndex.", | ||
), | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe what @FBruzzesi meant was that if no time_col
is given, and the DF doesn't have an index (I assume that's what's the case with polars), should we assign a range index? Otherwise, the user would have to add this index manually to the polars df.
For pandas this case will never exist, but for the others.
We could do below, and for the beginning raise a warning instead of an error:
if time_index is None:
time_index = pd.RangeIndex(len(df))
logger.info(
"No time column specified (`time_col=None`) and no index found in the DataFrame. Defaulting to "
"`pandas.RangeIndex(len(df))`. If this is not desired consider adding a time column "
"to your dataframe and defining `time_col`."
)
elif not ...
Co-authored-by: Dennis Bader <[email protected]>
Co-authored-by: Dennis Bader <[email protected]>
Checklist before merging this PR:
Fixes #2635.
Summary
A first draft of
from_dataframe
has been adapted to work with any dataframe. This is done using narwhals and the function is calledfrom_narwhals_dataframe
. In order to test the performance of the method, a filenarwhals_test_time.py
has been added to the pull request.With the latest commits,
from_narwhals_dataframe
is now as fast asfrom_dataframe
.Other Information