Narwhals implementation of `from_dataframe` and performance benchmark #2661

authierj · 2025-01-31T09:25:02Z

Checklist before merging this PR:

Mentioned all issues that this PR fixes or addresses.
Summarized the updates of this PR under Summary.
Added an entry under Unreleased in the Changelog.

Summary

A first draft of from_dataframe has been adapted to work with any dataframe. This is done using narwhals and the function is called from_narwhals_dataframe. In order to test the performance of the method, a file narwhals_test_time.py has been added to the pull request.
With the latest commits, from_narwhals_dataframe is now as fast as from_dataframe.

Other Information

MarcoGorelli

thanks for giving this a go!

I've left a couple of comments

I suspect the .to_list() calls may be responsible for the slow-down. I'll take a look

darts/timeseries.py

authierj · 2025-01-31T12:56:31Z

Hi @MarcoGorelli ,

Thanks for already looking at this and for your insights!

authierj · 2025-02-03T15:54:14Z

Hi @MarcoGorelli,

I investigated the issue, and it appears that the .to_list() call is not responsible for the slowdown. However, the call series_df.to_numpy()[:, :, np.newaxis] on line 906 is very slow. The investigation is going on!

FBruzzesi

Thanks @authierj for the effort on this! We really appreciate it! I left very non-relevant comments 😂

darts/timeseries.py

narwhals_test_time.py

darts/timeseries.py

codecov · 2025-02-04T09:24:28Z

Codecov Report

Attention: Patch coverage is 91.83673% with 4 lines in your changes missing coverage. Please review.

Project coverage is 94.09%. Comparing base (e086582) to head (3fa924f).
Report is 1 commits behind head on master.

Files with missing lines	Patch %	Lines
darts/timeseries.py	91.83%	4 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #2661      +/-   ##
==========================================
- Coverage   94.17%   94.09%   -0.09%     
==========================================
  Files         141      141              
  Lines       15582    15601      +19     
==========================================
+ Hits        14674    14679       +5     
- Misses        908      922      +14

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

…om/authierj/darts into feature/add_timeseries_from_polars

authierj · 2025-02-11T14:53:43Z

To compare the performance of the methods from_dataframe() (which only accepts pandas DataFrames as input) and from_narwhals_dataframe() (which accepts all kinds of DataFrames), I implemented the script narwhals_test_time.py, see here. This script calls the two functions for a large number of different pandas DataFrames, with shuffled or unshuffled data, varying sizes, indices, and datetime formats.

Averaged over 10 runs, the processing times are as follows:

method	average processing time [s]
from_dataframe()	10.9718
from_narwhals_dataframe()	9.8564

Therefore, from_narwhals_dataframe() is 1.1154 seconds faster than from_dataframe(), representing a 10.17% decrease in processing time on average.

As a consequence of this significant result, I will change the implementation of from_dataframe() and also modify from_series() to use the narwhals approach.

Co-authored-by: Dennis Bader <[email protected]>

…om/authierj/darts into feature/add_timeseries_from_polars

hrzn · 2025-02-22T07:50:42Z

This is very cool and I'm sure will make many users' lives easier!
I think it might be worth updating the docs / quickstart to maybe showcase an example for creating/exporting from/to polars?

FBruzzesi

Hey @authierj , I left a few suggestions and considerations in the from_* functions! Hope they help :)

darts/timeseries.py

FBruzzesi · 2025-02-22T10:19:08Z

darts/timeseries.py

+                raise_log(
+                    ValueError(
+                        "No time column or index found in the DataFrame. `time_col=None` "
+                        "is only supported for pandas DataFrame which is indexed with one of the "
+                        "supported index types: a DatetimeIndex, a RangeIndex, or an integer "
+                        "Index that can be converted into a RangeIndex.",
+                    ),
+                )


Should you consider the value to be np.arange(len(df)) or is that too big of an assumption?

We do! The condition np.issubdtype(time_index.dtype, np.integer) is True if the index is np.arange(len(df)) :)

I believe what @FBruzzesi meant was that if no time_col is given, and the DF doesn't have an index (I assume that's what's the case with polars), should we assign a range index? Otherwise, the user would have to add this index manually to the polars df.

For pandas this case will never exist, but for the others.

We could do below, and for the beginning raise a warning instead of an error:

if time_index is None: time_index = pd.RangeIndex(len(df)) logger.info( "No time column specified (`time_col=None`) and no index found in the DataFrame. Defaulting to " "`pandas.RangeIndex(len(df))`. If this is not desired consider adding a time column " "to your dataframe and defining `time_col`." ) elif not ...

Yes exactly, non pandas cases would end up raising if time_col is not provided. It's a design choice you will have to make, but wanted to point out that that was the case :)

darts/timeseries.py

dennisbader · 2025-02-23T12:24:23Z

This is very cool and I'm sure will make many users' lives easier! I think it might be worth updating the docs / quickstart to maybe showcase an example for creating/exporting from/to polars?

Agreed @hrzn :) To any dataframe support will be added in another PR.

Co-authored-by: Dennis Bader <[email protected]>

Co-authored-by: Francesco Bruzzesi <[email protected]>

dennisbader

Very nice @authierj 🚀 This looks great!

Also thanks @FBruzzesi for the additional comments.

Just some minor suggestions, then we're ready.

darts/timeseries.py

dennisbader · 2025-02-28T13:59:47Z

darts/timeseries.py

+                raise_log(
+                    ValueError(
+                        "No time column or index found in the DataFrame. `time_col=None` "
+                        "is only supported for pandas DataFrame which is indexed with one of the "
+                        "supported index types: a DatetimeIndex, a RangeIndex, or an integer "
+                        "Index that can be converted into a RangeIndex.",
+                    ),
+                )


I believe what @FBruzzesi meant was that if no time_col is given, and the DF doesn't have an index (I assume that's what's the case with polars), should we assign a range index? Otherwise, the user would have to add this index manually to the polars df.

For pandas this case will never exist, but for the others.

We could do below, and for the beginning raise a warning instead of an error:

if time_index is None: time_index = pd.RangeIndex(len(df)) logger.info( "No time column specified (`time_col=None`) and no index found in the DataFrame. Defaulting to " "`pandas.RangeIndex(len(df))`. If this is not desired consider adding a time column " "to your dataframe and defining `time_col`." ) elif not ...

Co-authored-by: Dennis Bader <[email protected]>

“authierj” and others added 2 commits January 31, 2025 10:16

narwhals implementation for and test benchmark

28a9298

Merge branch 'master' into feature/add_timeseries_from_polars

6382082

MarcoGorelli reviewed Jan 31, 2025

View reviewed changes

darts/timeseries.py Outdated Show resolved Hide resolved

darts/timeseries.py Outdated Show resolved Hide resolved

MarcoGorelli mentioned this pull request Jan 31, 2025

enh: add more dtype methods: is_integer, is_signed_integer, is_unsigned_integer, is_float, is_temporal narwhals-dev/narwhals#1899

Closed

FBruzzesi reviewed Feb 3, 2025

View reviewed changes

darts/timeseries.py Outdated Show resolved Hide resolved

narwhals_test_time.py Outdated Show resolved Hide resolved

darts/timeseries.py Show resolved Hide resolved

changes from MarcoGorelli incorporated

0041203

MarcoGorelli mentioned this pull request Feb 4, 2025

perf: use fastpath in DataFrame.to_numpy for pandas, improve performance for DataFrame.schema for pandas, use fewer values to sniff dtype for pandas objects narwhals-dev/narwhals#1929

Merged

10 tasks

dennisbader assigned authierj Feb 4, 2025

dennisbader added feature request Use this label to request a new feature improvement New feature or improvement labels Feb 4, 2025

“authierj” and others added 8 commits February 6, 2025 09:27

improvement thanks to reviewers

576e88e

Merge branch 'master' into feature/add_timeseries_from_polars

e013a42

added comments about slow and fast parts of the code

dbe2cd9

using pandas index to avoid .to_list()

b2ffc67

Merge branch 'master' into feature/add_timeseries_from_polars

c5fa503

bug fix added

79312c9

Merge branch 'feature/add_timeseries_from_polars' of https://github.c…

fc8bda4

…om/authierj/darts into feature/add_timeseries_from_polars

updated test script

b08a74f

authierj added 3 commits February 12, 2025 11:56

narwhals timeseries added

2425fbe

from_series changed, names changed

36300f2

changelog updated

ba01df1

authierj marked this pull request as ready for review February 14, 2025 12:22

authierj requested review from dennisbader and madtoinou as code owners February 14, 2025 12:22

authierj and others added 2 commits February 14, 2025 13:23

Merge branch 'master' into feature/add_timeseries_from_polars

ffd1202

small improvement

2e39269

authierj and others added 5 commits February 21, 2025 19:59

Update CHANGELOG.md

5afc721

Co-authored-by: Dennis Bader <[email protected]>

Update darts/timeseries.py

7877dd6

Co-authored-by: Dennis Bader <[email protected]>

easy corrections applied

102a26c

Merge branch 'feature/add_timeseries_from_polars' of https://github.c…

9d66c06

…om/authierj/darts into feature/add_timeseries_from_polars

Merge branch 'master' into feature/add_timeseries_from_polars

f629089

FBruzzesi reviewed Feb 22, 2025

View reviewed changes

FBruzzesi mentioned this pull request Feb 22, 2025

Add nw_df.to(backend, index). A generic method from Narwhals to Pandas, Polars, and Arrow with Index Handling Options narwhals-dev/narwhals#2056

Closed

authierj and others added 8 commits February 27, 2025 08:17

narwhals_test_time removed

56a20c1

Update requirements/core.txt

f764e19

Co-authored-by: Dennis Bader <[email protected]>

Update darts/timeseries.py

319a48f

Co-authored-by: Francesco Bruzzesi <[email protected]>

most corrections added

e8925f1

merged

05a7215

polars tests removed

11d17c1

Merge branch 'master' into feature/add_timeseries_from_polars

a720bb4

tests corrected

f9f5aa8

madtoinou mentioned this pull request Feb 28, 2025

Feat/test optional dep (onnx, ray, optuna) #2702

Merged

3 tasks

Merge branch 'master' into feature/add_timeseries_from_polars

e0b4984

dennisbader requested changes Feb 28, 2025

View reviewed changes

authierj and others added 3 commits February 28, 2025 16:28

Update darts/timeseries.py

c13cc1d

Co-authored-by: Dennis Bader <[email protected]>

Update darts/timeseries.py

370d761

Co-authored-by: Dennis Bader <[email protected]>

no time_col, define one

3fa924f

dennisbader approved these changes Feb 28, 2025

View reviewed changes

dennisbader merged commit 24cec52 into unit8co:master Feb 28, 2025
9 checks passed

github-project-automation bot moved this from In review to Done in darts Feb 28, 2025

FBruzzesi mentioned this pull request Feb 28, 2025

ci: Add darts to downstream tests narwhals-dev/narwhals#2118

Merged

10 tasks

dennisbader moved this from Done to Released in darts Mar 10, 2025

dennisbader mentioned this pull request Mar 13, 2025

Narwhalify from_group_dataframe() #2730

Open

cnhwl mentioned this pull request Apr 9, 2025

Narwhalify_from_group_dataframe #2766

Open

3 tasks

Narwhals implementation of from_dataframe and performance benchmark #2661

Narwhals implementation of from_dataframe and performance benchmark #2661

Uh oh!

Conversation

authierj commented Jan 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Other Information

Uh oh!

MarcoGorelli left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

authierj commented Jan 31, 2025

Uh oh!

authierj commented Feb 3, 2025

Uh oh!

FBruzzesi left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov bot commented Feb 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

authierj commented Feb 11, 2025

Uh oh!

hrzn commented Feb 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

FBruzzesi left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

FBruzzesi Feb 22, 2025

Choose a reason for hiding this comment

Uh oh!

authierj Feb 27, 2025

Choose a reason for hiding this comment

Uh oh!

dennisbader Feb 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

FBruzzesi Feb 28, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

dennisbader commented Feb 23, 2025

Uh oh!

dennisbader left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

dennisbader Feb 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Narwhals implementation of `from_dataframe` and performance benchmark #2661

Narwhals implementation of `from_dataframe` and performance benchmark #2661

authierj commented Jan 31, 2025 •

edited

Loading

codecov bot commented Feb 4, 2025 •

edited

Loading

hrzn commented Feb 22, 2025 •

edited

Loading

dennisbader Feb 28, 2025 •

edited

Loading

dennisbader Feb 28, 2025 •

edited

Loading