Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Narwhals implementation of from_dataframe and performance benchmark #2661

Merged
Merged
Show file tree
Hide file tree
Changes from 25 commits
Commits
Show all changes
42 commits
Select commit Hold shift + click to select a range
28a9298
narwhals implementation for and test benchmark
Jan 31, 2025
6382082
Merge branch 'master' into feature/add_timeseries_from_polars
authierj Jan 31, 2025
0041203
changes from MarcoGorelli incorporated
Feb 4, 2025
576e88e
improvement thanks to reviewers
Feb 6, 2025
e013a42
Merge branch 'master' into feature/add_timeseries_from_polars
authierj Feb 6, 2025
dbe2cd9
added comments about slow and fast parts of the code
authierj Feb 7, 2025
b2ffc67
using pandas index to avoid .to_list()
authierj Feb 10, 2025
c5fa503
Merge branch 'master' into feature/add_timeseries_from_polars
authierj Feb 10, 2025
79312c9
bug fix added
authierj Feb 10, 2025
fc8bda4
Merge branch 'feature/add_timeseries_from_polars' of https://github.c…
authierj Feb 10, 2025
b08a74f
updated test script
authierj Feb 11, 2025
2425fbe
narwhals timeseries added
authierj Feb 12, 2025
36300f2
from_series changed, names changed
authierj Feb 14, 2025
ba01df1
changelog updated
authierj Feb 14, 2025
ffd1202
Merge branch 'master' into feature/add_timeseries_from_polars
authierj Feb 14, 2025
2e39269
small improvement
authierj Feb 17, 2025
1a9a266
clean test scripts added
authierj Feb 17, 2025
a030ea5
Merge branch 'master' into feature/add_timeseries_from_polars
authierj Feb 17, 2025
2c24a39
BUGFIX added for non_pandas df
authierj Feb 19, 2025
89f23fb
tests added for polars df
authierj Feb 19, 2025
de0a32d
polars and narwhals added to dependencies. Ideally, polars should be …
authierj Feb 19, 2025
66b770d
Merge branch 'master' into feature/add_timeseries_from_polars
authierj Feb 20, 2025
16bac00
refactoring pd_series and pd_dataframe
authierj Feb 20, 2025
0950910
removed test scripts from git repo
authierj Feb 21, 2025
042f9fb
Merge branch 'master' into feature/add_timeseries_from_polars
authierj Feb 21, 2025
5afc721
Update CHANGELOG.md
authierj Feb 21, 2025
7877dd6
Update darts/timeseries.py
authierj Feb 21, 2025
102a26c
easy corrections applied
authierj Feb 21, 2025
9d66c06
Merge branch 'feature/add_timeseries_from_polars' of https://github.c…
authierj Feb 21, 2025
f629089
Merge branch 'master' into feature/add_timeseries_from_polars
authierj Feb 21, 2025
56a20c1
narwhals_test_time removed
authierj Feb 27, 2025
f764e19
Update requirements/core.txt
authierj Feb 27, 2025
319a48f
Update darts/timeseries.py
authierj Feb 27, 2025
e8925f1
most corrections added
authierj Feb 27, 2025
05a7215
merged
authierj Feb 27, 2025
11d17c1
polars tests removed
authierj Feb 27, 2025
a720bb4
Merge branch 'master' into feature/add_timeseries_from_polars
authierj Feb 27, 2025
f9f5aa8
tests corrected
authierj Feb 27, 2025
e0b4984
Merge branch 'master' into feature/add_timeseries_from_polars
dennisbader Feb 28, 2025
c13cc1d
Update darts/timeseries.py
authierj Feb 28, 2025
370d761
Update darts/timeseries.py
authierj Feb 28, 2025
3fa924f
no time_col, define one
authierj Feb 28, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ but cannot always guarantee backwards compatibility. Changes that may **break co

**Improved**

- Implemented the `from_dataframe()` and `from_series()` methods with [Narwhals](https://github.com/narwhals-dev/narwhals), a compatibility layer between dataframe librairies. From now on, Darts can transform pandas, polars, arrows and many other dataframes into `TimeSeries`. [#2661](https://github.com/unit8co/darts/pull/2661) by [Jules Authier](https://github.com/authierj)
- Added ONNX support for torch-based models with method `TorchForecastingModel.to_onnx()`. Check out [this example](https://unit8co.github.io/darts/userguide/gpu_and_tpu_usage.html#exporting-model-to-onnx-format-for-inference) from the user guide on how to export and load a model for inference. [#2620](https://github.com/unit8co/darts/pull/2620) by [Antoine Madrona](https://github.com/madtoinou)
- Made method `ForecastingModel.untrained_model()` public. Use this method to get a new (untrained) model instance created with the same parameters. [#2684](https://github.com/unit8co/darts/pull/2684) by [Timon Erhart](https://github.com/turbotimon)
- Made it possbile to run the quickstart notebook `00-quickstart.ipynb` locally. [#2691](https://github.com/unit8co/darts/pull/2691) by [Jules Authier](https://github.com/authierj)
Expand Down
123 changes: 91 additions & 32 deletions darts/tests/test_timeseries.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@

import numpy as np
import pandas as pd
import polars as pl
import pytest
import xarray as xr
from scipy.stats import kurtosis, skew
Expand Down Expand Up @@ -2506,7 +2507,16 @@ def test_tail_numeric_time_index(self):


class TestTimeSeriesFromDataFrame:
def test_from_dataframe_sunny_day(self):
def pd_to_backend(self, df, backend, index=False):
if backend == "pandas":
return df
elif backend == "polars":
if index:
return pl.from_pandas(df.reset_index())
return pl.from_pandas(df)

@pytest.mark.parametrize("backend", ["pandas", "polars"])
def test_from_dataframe_sunny_day(self, backend):
data_dict = {"Time": pd.date_range(start="20180501", end="20200301", freq="MS")}
data_dict["Values1"] = np.random.uniform(
low=-10, high=10, size=len(data_dict["Time"])
Expand All @@ -2520,58 +2530,78 @@ def test_from_dataframe_sunny_day(self):
data_pd2["Time"] = data_pd2["Time"].apply(lambda date: str(date))
data_pd3 = data_pd1.set_index("Time")

data_darts1 = TimeSeries.from_dataframe(df=data_pd1, time_col="Time")
data_darts2 = TimeSeries.from_dataframe(df=data_pd2, time_col="Time")
data_darts3 = TimeSeries.from_dataframe(df=data_pd3)
data_darts1 = TimeSeries.from_dataframe(
df=self.pd_to_backend(data_pd1, backend), time_col="Time"
)
data_darts2 = TimeSeries.from_dataframe(
df=self.pd_to_backend(data_pd2, backend), time_col="Time"
)
data_darts3 = TimeSeries.from_dataframe(
df=self.pd_to_backend(data_pd3, backend, index=True),
time_col=None if backend == "pandas" else "Time",
)

assert data_darts1 == data_darts2
assert data_darts1 == data_darts3

def test_time_col_convert_string_integers(self):
@pytest.mark.parametrize("backend", ["pandas", "polars"])
def test_time_col_convert_string_integers(self, backend):
expected = np.array(list(range(3, 10)))
data_dict = {"Time": expected.astype(str)}
data_dict["Values1"] = np.random.uniform(
low=-10, high=10, size=len(data_dict["Time"])
)
df = pd.DataFrame(data_dict)
ts = TimeSeries.from_dataframe(df=df, time_col="Time")
ts = TimeSeries.from_dataframe(
df=self.pd_to_backend(df, backend), time_col="Time"
)

assert set(ts.time_index.values.tolist()) == set(expected)
assert ts.time_index.dtype == int
assert ts.time_index.name == "Time"

def test_time_col_convert_integers(self):
@pytest.mark.parametrize("backend", ["pandas", "polars"])
def test_time_col_convert_integers(self, backend):
expected = np.array(list(range(10)))
data_dict = {"Time": expected}
data_dict["Values1"] = np.random.uniform(
low=-10, high=10, size=len(data_dict["Time"])
)

df = pd.DataFrame(data_dict)
ts = TimeSeries.from_dataframe(df=df, time_col="Time")
ts = TimeSeries.from_dataframe(
df=self.pd_to_backend(df, backend), time_col="Time"
)

assert set(ts.time_index.values.tolist()) == set(expected)
assert ts.time_index.dtype == int
assert ts.time_index.name == "Time"

def test_fail_with_bad_integer_time_col(self):
@pytest.mark.parametrize("backend", ["pandas", "polars"])
def test_fail_with_bad_integer_time_col(self, backend):
bad_time_col_vals = np.array([4, 0, 1, 2])
data_dict = {"Time": bad_time_col_vals}
data_dict["Values1"] = np.random.uniform(
low=-10, high=10, size=len(data_dict["Time"])
)
df = pd.DataFrame(data_dict)
with pytest.raises(ValueError):
TimeSeries.from_dataframe(df=df, time_col="Time")
TimeSeries.from_dataframe(
df=self.pd_to_backend(df, backend), time_col="Time"
)

def test_time_col_convert_rangeindex(self):
@pytest.mark.parametrize("backend", ["pandas", "polars"])
def test_time_col_convert_rangeindex(self, backend):
for expected_l, step in zip([[4, 0, 2, 3, 1], [8, 0, 4, 6, 2]], [1, 2]):
expected = np.array(expected_l)
data_dict = {"Time": expected}
data_dict["Values1"] = np.random.uniform(
low=-10, high=10, size=len(data_dict["Time"])
)
df = pd.DataFrame(data_dict)
ts = TimeSeries.from_dataframe(df=df, time_col="Time")
ts = TimeSeries.from_dataframe(
df=self.pd_to_backend(df, backend), time_col="Time"
)

# check type (should convert to RangeIndex):
assert type(ts.time_index) is pd.RangeIndex
Expand All @@ -2586,31 +2616,38 @@ def test_time_col_convert_rangeindex(self):
]
assert np.all(ar1 == ar2)

def test_time_col_convert_datetime(self):
@pytest.mark.parametrize("backend", ["pandas", "polars"])
def test_time_col_convert_datetime(self, backend):
expected = pd.date_range(start="20180501", end="20200301", freq="MS")
data_dict = {"Time": expected}
data_dict["Values1"] = np.random.uniform(
low=-10, high=10, size=len(data_dict["Time"])
)
df = pd.DataFrame(data_dict)
ts = TimeSeries.from_dataframe(df=df, time_col="Time")
ts = TimeSeries.from_dataframe(
df=self.pd_to_backend(df, backend), time_col="Time"
)

assert ts.time_index.dtype == "datetime64[ns]"
assert ts.time_index.name == "Time"

def test_time_col_convert_datetime_strings(self):
@pytest.mark.parametrize("backend", ["pandas", "polars"])
def test_time_col_convert_datetime_strings(self, backend):
expected = pd.date_range(start="20180501", end="20200301", freq="MS")
data_dict = {"Time": expected.values.astype(str)}
data_dict["Values1"] = np.random.uniform(
low=-10, high=10, size=len(data_dict["Time"])
)
df = pd.DataFrame(data_dict)
ts = TimeSeries.from_dataframe(df=df, time_col="Time")
ts = TimeSeries.from_dataframe(
df=self.pd_to_backend(df, backend), time_col="Time"
)

assert ts.time_index.dtype == "datetime64[ns]"
assert ts.time_index.name == "Time"

def test_time_col_with_tz(self):
@pytest.mark.parametrize("backend", ["pandas", "polars"])
def test_time_col_with_tz_df(self, backend):
# numpy and xarray don't support "timezone aware" pd.DatetimeIndex
# the BUGFIX removes timezone information without conversion

Expand All @@ -2621,13 +2658,10 @@ def test_time_col_with_tz(self):
# pd.DataFrame loses the tz information unless it is contained in its index
# (other columns are silently converted to UTC, with tz attribute set to None)
df = pd.DataFrame(data=values, index=time_range_MS)
ts = TimeSeries.from_dataframe(df=df)
assert list(ts.time_index) == list(time_range_MS.tz_localize(None))
assert list(ts.time_index.tz_localize("CET")) == list(time_range_MS)
assert ts.time_index.tz is None

serie = pd.Series(data=values, index=time_range_MS)
ts = TimeSeries.from_series(pd_series=serie)
ts = TimeSeries.from_dataframe(
df=self.pd_to_backend(df, backend, index=True),
time_col=None if backend == "pandas" else "index",
)
assert list(ts.time_index) == list(time_range_MS.tz_localize(None))
assert list(ts.time_index.tz_localize("CET")) == list(time_range_MS)
assert ts.time_index.tz is None
Expand All @@ -2643,23 +2677,42 @@ def test_time_col_with_tz(self):
values = np.random.uniform(low=-10, high=10, size=len(time_range_H))

df = pd.DataFrame(data=values, index=time_range_H)
ts = TimeSeries.from_dataframe(df=df)
ts = TimeSeries.from_dataframe(
df=self.pd_to_backend(df, backend, index=True),
time_col=None if backend == "pandas" else "index",
)
assert list(ts.time_index) == list(time_range_H.tz_localize(None))
assert list(ts.time_index.tz_localize("CET")) == list(time_range_H)
assert ts.time_index.tz is None

series = pd.Series(data=values, index=time_range_H)
ts = TimeSeries.from_series(pd_series=series)
ts = TimeSeries.from_times_and_values(times=time_range_H, values=values)
assert list(ts.time_index) == list(time_range_H.tz_localize(None))
assert list(ts.time_index.tz_localize("CET")) == list(time_range_H)
assert ts.time_index.tz is None

ts = TimeSeries.from_times_and_values(times=time_range_H, values=values)
def test_time_col_with_tz_series(self):
time_range_MS = pd.date_range(
start="20180501", end="20200301", freq="MS", tz="CET"
)
values = np.random.uniform(low=-10, high=10, size=len(time_range_MS))
serie = pd.Series(data=values, index=time_range_MS)
ts = TimeSeries.from_series(pd_series=serie)
assert list(ts.time_index) == list(time_range_MS.tz_localize(None))
assert list(ts.time_index.tz_localize("CET")) == list(time_range_MS)
assert ts.time_index.tz is None

time_range_H = pd.date_range(
start="20200518", end="20200521", freq=freqs["h"], tz="CET"
)
values = np.random.uniform(low=-10, high=10, size=len(time_range_H))
series = pd.Series(data=values, index=time_range_H)
ts = TimeSeries.from_series(pd_series=series)
assert list(ts.time_index) == list(time_range_H.tz_localize(None))
assert list(ts.time_index.tz_localize("CET")) == list(time_range_H)
assert ts.time_index.tz is None

def test_time_col_convert_garbage(self):
@pytest.mark.parametrize("backend", ["pandas", "polars"])
def test_time_col_convert_garbage(self, backend):
expected = [
"2312312asdfdw",
"asdfsdf432sdf",
Expand All @@ -2674,9 +2727,12 @@ def test_time_col_convert_garbage(self):
df = pd.DataFrame(data_dict)

with pytest.raises(AttributeError):
TimeSeries.from_dataframe(df=df, time_col="Time")
TimeSeries.from_dataframe(
df=self.pd_to_backend(df, backend), time_col="Time"
)

def test_df_named_columns_index(self):
@pytest.mark.parametrize("backend", ["pandas", "polars"])
def test_df_named_columns_index(self, backend):
time_index = generate_index(
start=pd.Timestamp("2000-01-01"), length=4, freq="D", name="index"
)
Expand All @@ -2686,7 +2742,10 @@ def test_df_named_columns_index(self):
columns=["y"],
)
df.columns.name = "id"
ts = TimeSeries.from_dataframe(df)
ts = TimeSeries.from_dataframe(
df=self.pd_to_backend(df, backend, index=True),
time_col=None if backend == "pandas" else "index",
)

exp_ts = TimeSeries.from_times_and_values(
times=time_index,
Expand Down
Loading
Loading