-
Notifications
You must be signed in to change notification settings - Fork 919
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Narwhals implementation of from_dataframe
and performance benchmark
#2661
Narwhals implementation of from_dataframe
and performance benchmark
#2661
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for giving this a go!
I've left a couple of comments
I suspect the .to_list()
calls may be responsible for the slow-down. I'll take a look
Hi @MarcoGorelli , Thanks for already looking at this and for your insights! |
Hi @MarcoGorelli, I investigated the issue, and it appears that the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @authierj for the effort on this! We really appreciate it! I left very non-relevant comments 😂
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #2661 +/- ##
==========================================
- Coverage 94.17% 94.09% -0.09%
==========================================
Files 141 141
Lines 15582 15601 +19
==========================================
+ Hits 14674 14679 +5
- Misses 908 922 +14 ☔ View full report in Codecov by Sentry. |
…om/authierj/darts into feature/add_timeseries_from_polars
To compare the performance of the methods Averaged over 10 runs, the processing times are as follows:
Therefore, As a consequence of this significant result, I will change the implementation of |
The script used for testing the time performance of the methods is the following: import argparse
import json
import time
import warnings
from itertools import product
import numpy as np
import pandas as pd
from tqdm import tqdm
from darts.timeseries import TimeSeries
# Suppress all warnings
warnings.filterwarnings("ignore")
def test_from_dataframe(f_name: str):
return getattr(TimeSeries, f_name)
def create_random_dataframes(
num_rows: int = 10,
num_columns: int = 3,
index: bool = True,
col_names_given: bool = True,
start_date: str = "1900-01-01",
freq: str = "D",
) -> tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame]:
"""
Create three pandas DataFrames with random data and dates as the index or as a column.
Parameters:
- num_rows (int): The number of rows in the DataFrames.
- num_columns (int): The number of columns in the DataFrames.
- index (bool): If True, the date is the index of the DataFrame. If False, the date is a column named 'date'.
- start_date (str): The start date for the date range (used only if date_format is 'date').
- freq (str): The frequency of the date range (used only if date_format is 'date').
Returns:
- tuple: A tuple containing three DataFrames (df_date, df_numpy, df_integer).
"""
# Set a random seed for reproducibility
np.random.seed(42)
# Generate a date range or integer list based on the date_format parameter
date_values = pd.date_range(start=start_date, periods=num_rows, freq=freq)
integer_values = list(range(1, num_rows + 1))
numpy_values = np.array(
pd.date_range(start=start_date, periods=num_rows, freq=freq),
dtype="datetime64[D]",
)
# Create random data for the DataFrames
data = {f"col_{i}": np.random.randn(num_rows) for i in range(num_columns)}
# Create the DataFrames
df_date = pd.DataFrame(data)
df_numpy = pd.DataFrame(data)
df_integer = pd.DataFrame(data)
if col_names_given:
col_names = df_date.columns.values
else:
col_names = None
# Set the date as index or as a column based on the index parameter
if index:
df_date.index = date_values
df_numpy.index = numpy_values
df_integer.index = integer_values
else:
df_date["date"] = date_values
df_numpy["date"] = numpy_values
df_integer["date"] = integer_values
if index:
time_col = None
else:
time_col = "date"
return [
[df_date, col_names, time_col],
[df_numpy, col_names, time_col],
[df_integer, col_names, time_col],
]
def test_dataframes() -> list:
test_config = product(
[10, 100, 1000, 10000, 100000],
[100],
[True, False],
[True, False],
)
dataframes_list = [
create_random_dataframes(
num_rows=num_rows,
num_columns=num_columns,
index=index,
col_names_given=col_names_given,
)
for num_rows, num_columns, index, col_names_given in test_config
]
return dataframes_list
def calculate_processing_time(
f_name: str,
num_iter: int,
save_path="data/",
):
df_list = test_dataframes()
df_func = test_from_dataframe(f_name)
# Initialize dictionaries to store processing times
times = {}
# Initialize the progress bar
total_iterations = (
len(df_list) * 2 * 3
) # 2 iterations per dataframe configuration, 3 df per config
progress_bar = tqdm(total=total_iterations, desc="Processing DataFrames")
for df_config in df_list:
for df, col_names, time_col in df_config:
num_rows = len(df)
dict_entry = str(num_rows)
for i in range(2):
# on the second run we shuffle the data
if i == 1:
df = df.sample(frac=1)
dict_entry += "_shuffled"
begin = time.time()
for _ in range(num_iter):
_ = df_func(df, value_cols=col_names, time_col=time_col, freq=None)
end = time.time()
timer = (end - begin) / num_iter
if dict_entry not in times:
times[dict_entry] = timer
else:
times[dict_entry] += timer
# Update the progress bar
progress_bar.update(1)
file_name = f_name + "_avg_time_" + str(num_iter) + "_iter.json"
# Store the average times in separate JSON files
with open(save_path + file_name, "w") as f:
json.dump(times, f, indent=4)
if __name__ == "__main__":
parser = argparse.ArgumentParser(
description="The function to test and the number of iter can "
)
parser.add_argument(
"--f_name", type=str, default="from_dataframe", help="method to time"
)
parser.add_argument(
"--n_iter", type=int, default=100, help="number of function call"
)
args = parser.parse_args()
f_name = args.f_name
n_iter = args.n_iter
calculate_processing_time(f_name, n_iter) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this great PR @authierj 🚀
This will be a great addition to Darts :)
I added a couple of suggestions, mainly on how to further simplify some things here and there, and improve the documentation
Co-authored-by: Dennis Bader <[email protected]>
Co-authored-by: Dennis Bader <[email protected]>
…om/authierj/darts into feature/add_timeseries_from_polars
This is very cool and I'm sure will make many users' lives easier! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @authierj , I left a few suggestions and considerations in the from_*
functions! Hope they help :)
darts/timeseries.py
Outdated
raise_log( | ||
ValueError( | ||
"No time column or index found in the DataFrame. `time_col=None` " | ||
"is only supported for pandas DataFrame which is indexed with one of the " | ||
"supported index types: a DatetimeIndex, a RangeIndex, or an integer " | ||
"Index that can be converted into a RangeIndex.", | ||
), | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should you consider the value to be np.arange(len(df))
or is that too big of an assumption?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do! The condition np.issubdtype(time_index.dtype, np.integer)
is True
if the index is np.arange(len(df))
:)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe what @FBruzzesi meant was that if no time_col
is given, and the DF doesn't have an index (I assume that's what's the case with polars), should we assign a range index? Otherwise, the user would have to add this index manually to the polars df.
For pandas this case will never exist, but for the others.
We could do below, and for the beginning raise a warning instead of an error:
if time_index is None:
time_index = pd.RangeIndex(len(df))
logger.info(
"No time column specified (`time_col=None`) and no index found in the DataFrame. Defaulting to "
"`pandas.RangeIndex(len(df))`. If this is not desired consider adding a time column "
"to your dataframe and defining `time_col`."
)
elif not ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes exactly, non pandas cases would end up raising if time_col
is not provided. It's a design choice you will have to make, but wanted to point out that that was the case :)
Agreed @hrzn :) To any dataframe support will be added in another PR. |
Co-authored-by: Dennis Bader <[email protected]>
Co-authored-by: Francesco Bruzzesi <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice @authierj 🚀 This looks great!
Also thanks @FBruzzesi for the additional comments.
Just some minor suggestions, then we're ready.
darts/timeseries.py
Outdated
raise_log( | ||
ValueError( | ||
"No time column or index found in the DataFrame. `time_col=None` " | ||
"is only supported for pandas DataFrame which is indexed with one of the " | ||
"supported index types: a DatetimeIndex, a RangeIndex, or an integer " | ||
"Index that can be converted into a RangeIndex.", | ||
), | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe what @FBruzzesi meant was that if no time_col
is given, and the DF doesn't have an index (I assume that's what's the case with polars), should we assign a range index? Otherwise, the user would have to add this index manually to the polars df.
For pandas this case will never exist, but for the others.
We could do below, and for the beginning raise a warning instead of an error:
if time_index is None:
time_index = pd.RangeIndex(len(df))
logger.info(
"No time column specified (`time_col=None`) and no index found in the DataFrame. Defaulting to "
"`pandas.RangeIndex(len(df))`. If this is not desired consider adding a time column "
"to your dataframe and defining `time_col`."
)
elif not ...
Co-authored-by: Dennis Bader <[email protected]>
Co-authored-by: Dennis Bader <[email protected]>
Checklist before merging this PR:
Fixes #2635.
Summary
A first draft of
from_dataframe
has been adapted to work with any dataframe. This is done using narwhals and the function is calledfrom_narwhals_dataframe
. In order to test the performance of the method, a filenarwhals_test_time.py
has been added to the pull request.With the latest commits,
from_narwhals_dataframe
is now as fast asfrom_dataframe
.Other Information