Skip to content

Narwhals implementation of from_dataframe and performance benchmark #2661

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

authierj
Copy link
Contributor

@authierj authierj commented Jan 31, 2025

Checklist before merging this PR:

  • Mentioned all issues that this PR fixes or addresses.
  • Summarized the updates of this PR under Summary.
  • Added an entry under Unreleased in the Changelog.

Fixes #2635.

Summary

A first draft of from_dataframe has been adapted to work with any dataframe. This is done using narwhals and the function is called from_narwhals_dataframe. In order to test the performance of the method, a file narwhals_test_time.py has been added to the pull request.
With the latest commits, from_narwhals_dataframe is now as fast as from_dataframe.

Other Information

Copy link

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for giving this a go!

I've left a couple of comments

I suspect the .to_list() calls may be responsible for the slow-down. I'll take a look

@authierj
Copy link
Contributor Author

Hi @MarcoGorelli ,

Thanks for already looking at this and for your insights!

@authierj
Copy link
Contributor Author

authierj commented Feb 3, 2025

Hi @MarcoGorelli,

I investigated the issue, and it appears that the .to_list() call is not responsible for the slowdown. However, the call series_df.to_numpy()[:, :, np.newaxis] on line 906 is very slow. The investigation is going on!

Copy link
Contributor

@FBruzzesi FBruzzesi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @authierj for the effort on this! We really appreciate it! I left very non-relevant comments 😂

Copy link

codecov bot commented Feb 4, 2025

Codecov Report

Attention: Patch coverage is 91.83673% with 4 lines in your changes missing coverage. Please review.

Project coverage is 94.09%. Comparing base (e086582) to head (3fa924f).
Report is 1 commits behind head on master.

Files with missing lines Patch % Lines
darts/timeseries.py 91.83% 4 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #2661      +/-   ##
==========================================
- Coverage   94.17%   94.09%   -0.09%     
==========================================
  Files         141      141              
  Lines       15582    15601      +19     
==========================================
+ Hits        14674    14679       +5     
- Misses        908      922      +14     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@authierj
Copy link
Contributor Author

To compare the performance of the methods from_dataframe() (which only accepts pandas DataFrames as input) and from_narwhals_dataframe() (which accepts all kinds of DataFrames), I implemented the script narwhals_test_time.py, see here. This script calls the two functions for a large number of different pandas DataFrames, with shuffled or unshuffled data, varying sizes, indices, and datetime formats.

Averaged over 10 runs, the processing times are as follows:

method average processing time [s]
from_dataframe() 10.9718
from_narwhals_dataframe() 9.8564

Therefore, from_narwhals_dataframe() is 1.1154 seconds faster than from_dataframe(), representing a 10.17% decrease in processing time on average.

As a consequence of this significant result, I will change the implementation of from_dataframe() and also modify from_series() to use the narwhals approach.

@authierj authierj marked this pull request as ready for review February 14, 2025 12:22
@hrzn
Copy link
Contributor

hrzn commented Feb 22, 2025

This is very cool and I'm sure will make many users' lives easier!
I think it might be worth updating the docs / quickstart to maybe showcase an example for creating/exporting from/to polars?

Copy link
Contributor

@FBruzzesi FBruzzesi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @authierj , I left a few suggestions and considerations in the from_* functions! Hope they help :)

Comment on lines 722 to 729
raise_log(
ValueError(
"No time column or index found in the DataFrame. `time_col=None` "
"is only supported for pandas DataFrame which is indexed with one of the "
"supported index types: a DatetimeIndex, a RangeIndex, or an integer "
"Index that can be converted into a RangeIndex.",
),
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should you consider the value to be np.arange(len(df)) or is that too big of an assumption?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do! The condition np.issubdtype(time_index.dtype, np.integer) is True if the index is np.arange(len(df)) :)

Copy link
Collaborator

@dennisbader dennisbader Feb 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe what @FBruzzesi meant was that if no time_col is given, and the DF doesn't have an index (I assume that's what's the case with polars), should we assign a range index? Otherwise, the user would have to add this index manually to the polars df.

For pandas this case will never exist, but for the others.

We could do below, and for the beginning raise a warning instead of an error:

if time_index is None:
    time_index = pd.RangeIndex(len(df))
    logger.info(
        "No time column specified (`time_col=None`) and no index found in the DataFrame. Defaulting to "
        "`pandas.RangeIndex(len(df))`. If this is not desired consider adding a time column "
        "to your dataframe and defining `time_col`."
    )
elif not ...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes exactly, non pandas cases would end up raising if time_col is not provided. It's a design choice you will have to make, but wanted to point out that that was the case :)

@dennisbader
Copy link
Collaborator

This is very cool and I'm sure will make many users' lives easier! I think it might be worth updating the docs / quickstart to maybe showcase an example for creating/exporting from/to polars?

Agreed @hrzn :) To any dataframe support will be added in another PR.

Copy link
Collaborator

@dennisbader dennisbader left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice @authierj 🚀 This looks great!

Also thanks @FBruzzesi for the additional comments.

Just some minor suggestions, then we're ready.

Comment on lines 722 to 729
raise_log(
ValueError(
"No time column or index found in the DataFrame. `time_col=None` "
"is only supported for pandas DataFrame which is indexed with one of the "
"supported index types: a DatetimeIndex, a RangeIndex, or an integer "
"Index that can be converted into a RangeIndex.",
),
)
Copy link
Collaborator

@dennisbader dennisbader Feb 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe what @FBruzzesi meant was that if no time_col is given, and the DF doesn't have an index (I assume that's what's the case with polars), should we assign a range index? Otherwise, the user would have to add this index manually to the polars df.

For pandas this case will never exist, but for the others.

We could do below, and for the beginning raise a warning instead of an error:

if time_index is None:
    time_index = pd.RangeIndex(len(df))
    logger.info(
        "No time column specified (`time_col=None`) and no index found in the DataFrame. Defaulting to "
        "`pandas.RangeIndex(len(df))`. If this is not desired consider adding a time column "
        "to your dataframe and defining `time_col`."
    )
elif not ...

@dennisbader dennisbader merged commit 24cec52 into unit8co:master Feb 28, 2025
9 checks passed
@github-project-automation github-project-automation bot moved this from In review to Done in darts Feb 28, 2025
@dennisbader dennisbader moved this from Done to Released in darts Mar 10, 2025
@cnhwl cnhwl mentioned this pull request Apr 9, 2025
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request Use this label to request a new feature improvement New feature or improvement
Projects
Status: Released
Development

Successfully merging this pull request may close these issues.

Add TimeSeries.from_polars
5 participants