Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Narwhals implementation of from_dataframe and performance benchmark #2661

Merged

Conversation

authierj
Copy link
Contributor

@authierj authierj commented Jan 31, 2025

Checklist before merging this PR:

  • Mentioned all issues that this PR fixes or addresses.
  • Summarized the updates of this PR under Summary.
  • Added an entry under Unreleased in the Changelog.

Fixes #2635.

Summary

A first draft of from_dataframe has been adapted to work with any dataframe. This is done using narwhals and the function is called from_narwhals_dataframe. In order to test the performance of the method, a file narwhals_test_time.py has been added to the pull request.
With the latest commits, from_narwhals_dataframe is now as fast as from_dataframe.

Other Information

Copy link

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for giving this a go!

I've left a couple of comments

I suspect the .to_list() calls may be responsible for the slow-down. I'll take a look

@authierj
Copy link
Contributor Author

Hi @MarcoGorelli ,

Thanks for already looking at this and for your insights!

@authierj
Copy link
Contributor Author

authierj commented Feb 3, 2025

Hi @MarcoGorelli,

I investigated the issue, and it appears that the .to_list() call is not responsible for the slowdown. However, the call series_df.to_numpy()[:, :, np.newaxis] on line 906 is very slow. The investigation is going on!

Copy link
Contributor

@FBruzzesi FBruzzesi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @authierj for the effort on this! We really appreciate it! I left very non-relevant comments 😂

Copy link

codecov bot commented Feb 4, 2025

Codecov Report

Attention: Patch coverage is 91.83673% with 4 lines in your changes missing coverage. Please review.

Project coverage is 94.09%. Comparing base (e086582) to head (3fa924f).
Report is 1 commits behind head on master.

Files with missing lines Patch % Lines
darts/timeseries.py 91.83% 4 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #2661      +/-   ##
==========================================
- Coverage   94.17%   94.09%   -0.09%     
==========================================
  Files         141      141              
  Lines       15582    15601      +19     
==========================================
+ Hits        14674    14679       +5     
- Misses        908      922      +14     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@authierj
Copy link
Contributor Author

To compare the performance of the methods from_dataframe() (which only accepts pandas DataFrames as input) and from_narwhals_dataframe() (which accepts all kinds of DataFrames), I implemented the script narwhals_test_time.py, see here. This script calls the two functions for a large number of different pandas DataFrames, with shuffled or unshuffled data, varying sizes, indices, and datetime formats.

Averaged over 10 runs, the processing times are as follows:

method average processing time [s]
from_dataframe() 10.9718
from_narwhals_dataframe() 9.8564

Therefore, from_narwhals_dataframe() is 1.1154 seconds faster than from_dataframe(), representing a 10.17% decrease in processing time on average.

As a consequence of this significant result, I will change the implementation of from_dataframe() and also modify from_series() to use the narwhals approach.

@authierj authierj marked this pull request as ready for review February 14, 2025 12:22
@authierj
Copy link
Contributor Author

authierj commented Feb 21, 2025

The script used for testing the time performance of the methods is the following:

import argparse
import json
import time
import warnings
from itertools import product

import numpy as np
import pandas as pd
from tqdm import tqdm

from darts.timeseries import TimeSeries

# Suppress all warnings
warnings.filterwarnings("ignore")


def test_from_dataframe(f_name: str):
    return getattr(TimeSeries, f_name)


def create_random_dataframes(
    num_rows: int = 10,
    num_columns: int = 3,
    index: bool = True,
    col_names_given: bool = True,
    start_date: str = "1900-01-01",
    freq: str = "D",
) -> tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame]:
    """
    Create three pandas DataFrames with random data and dates as the index or as a column.

    Parameters:
    - num_rows (int): The number of rows in the DataFrames.
    - num_columns (int): The number of columns in the DataFrames.
    - index (bool): If True, the date is the index of the DataFrame. If False, the date is a column named 'date'.
    - start_date (str): The start date for the date range (used only if date_format is 'date').
    - freq (str): The frequency of the date range (used only if date_format is 'date').

    Returns:
    - tuple: A tuple containing three DataFrames (df_date, df_numpy, df_integer).
    """
    # Set a random seed for reproducibility
    np.random.seed(42)

    # Generate a date range or integer list based on the date_format parameter
    date_values = pd.date_range(start=start_date, periods=num_rows, freq=freq)
    integer_values = list(range(1, num_rows + 1))
    numpy_values = np.array(
        pd.date_range(start=start_date, periods=num_rows, freq=freq),
        dtype="datetime64[D]",
    )

    # Create random data for the DataFrames
    data = {f"col_{i}": np.random.randn(num_rows) for i in range(num_columns)}

    # Create the DataFrames
    df_date = pd.DataFrame(data)
    df_numpy = pd.DataFrame(data)
    df_integer = pd.DataFrame(data)

    if col_names_given:
        col_names = df_date.columns.values
    else:
        col_names = None

    # Set the date as index or as a column based on the index parameter
    if index:
        df_date.index = date_values
        df_numpy.index = numpy_values
        df_integer.index = integer_values
    else:
        df_date["date"] = date_values
        df_numpy["date"] = numpy_values
        df_integer["date"] = integer_values

    if index:
        time_col = None
    else:
        time_col = "date"

    return [
        [df_date, col_names, time_col],
        [df_numpy, col_names, time_col],
        [df_integer, col_names, time_col],
    ]


def test_dataframes() -> list:
    test_config = product(
        [10, 100, 1000, 10000, 100000],
        [100],
        [True, False],
        [True, False],
    )

    dataframes_list = [
        create_random_dataframes(
            num_rows=num_rows,
            num_columns=num_columns,
            index=index,
            col_names_given=col_names_given,
        )
        for num_rows, num_columns, index, col_names_given in test_config
    ]

    return dataframes_list


def calculate_processing_time(
    f_name: str,
    num_iter: int,
    save_path="data/",
):
    df_list = test_dataframes()
    df_func = test_from_dataframe(f_name)

    # Initialize dictionaries to store processing times
    times = {}

    # Initialize the progress bar
    total_iterations = (
        len(df_list) * 2 * 3
    )  # 2 iterations per dataframe configuration, 3 df per config
    progress_bar = tqdm(total=total_iterations, desc="Processing DataFrames")

    for df_config in df_list:
        for df, col_names, time_col in df_config:
            num_rows = len(df)
            dict_entry = str(num_rows)

            for i in range(2):
                # on the second run we shuffle the data
                if i == 1:
                    df = df.sample(frac=1)
                    dict_entry += "_shuffled"

                begin = time.time()
                for _ in range(num_iter):
                    _ = df_func(df, value_cols=col_names, time_col=time_col, freq=None)
                end = time.time()
                timer = (end - begin) / num_iter

                if dict_entry not in times:
                    times[dict_entry] = timer
                else:
                    times[dict_entry] += timer

                # Update the progress bar
                progress_bar.update(1)

    file_name = f_name + "_avg_time_" + str(num_iter) + "_iter.json"

    # Store the average times in separate JSON files
    with open(save_path + file_name, "w") as f:
        json.dump(times, f, indent=4)


if __name__ == "__main__":
    parser = argparse.ArgumentParser(
        description="The function to test and the number of iter can "
    )
    parser.add_argument(
        "--f_name", type=str, default="from_dataframe", help="method to time"
    )
    parser.add_argument(
        "--n_iter", type=int, default=100, help="number of function call"
    )

    args = parser.parse_args()

    f_name = args.f_name
    n_iter = args.n_iter

    calculate_processing_time(f_name, n_iter)

Copy link
Collaborator

@dennisbader dennisbader left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this great PR @authierj 🚀

This will be a great addition to Darts :)

I added a couple of suggestions, mainly on how to further simplify some things here and there, and improve the documentation

@hrzn
Copy link
Contributor

hrzn commented Feb 22, 2025

This is very cool and I'm sure will make many users' lives easier!
I think it might be worth updating the docs / quickstart to maybe showcase an example for creating/exporting from/to polars?

Copy link
Contributor

@FBruzzesi FBruzzesi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @authierj , I left a few suggestions and considerations in the from_* functions! Hope they help :)

Comment on lines 722 to 729
raise_log(
ValueError(
"No time column or index found in the DataFrame. `time_col=None` "
"is only supported for pandas DataFrame which is indexed with one of the "
"supported index types: a DatetimeIndex, a RangeIndex, or an integer "
"Index that can be converted into a RangeIndex.",
),
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should you consider the value to be np.arange(len(df)) or is that too big of an assumption?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do! The condition np.issubdtype(time_index.dtype, np.integer) is True if the index is np.arange(len(df)) :)

Copy link
Collaborator

@dennisbader dennisbader Feb 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe what @FBruzzesi meant was that if no time_col is given, and the DF doesn't have an index (I assume that's what's the case with polars), should we assign a range index? Otherwise, the user would have to add this index manually to the polars df.

For pandas this case will never exist, but for the others.

We could do below, and for the beginning raise a warning instead of an error:

if time_index is None:
    time_index = pd.RangeIndex(len(df))
    logger.info(
        "No time column specified (`time_col=None`) and no index found in the DataFrame. Defaulting to "
        "`pandas.RangeIndex(len(df))`. If this is not desired consider adding a time column "
        "to your dataframe and defining `time_col`."
    )
elif not ...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes exactly, non pandas cases would end up raising if time_col is not provided. It's a design choice you will have to make, but wanted to point out that that was the case :)

@dennisbader
Copy link
Collaborator

This is very cool and I'm sure will make many users' lives easier! I think it might be worth updating the docs / quickstart to maybe showcase an example for creating/exporting from/to polars?

Agreed @hrzn :) To any dataframe support will be added in another PR.

Copy link
Collaborator

@dennisbader dennisbader left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice @authierj 🚀 This looks great!

Also thanks @FBruzzesi for the additional comments.

Just some minor suggestions, then we're ready.

Comment on lines 722 to 729
raise_log(
ValueError(
"No time column or index found in the DataFrame. `time_col=None` "
"is only supported for pandas DataFrame which is indexed with one of the "
"supported index types: a DatetimeIndex, a RangeIndex, or an integer "
"Index that can be converted into a RangeIndex.",
),
)
Copy link
Collaborator

@dennisbader dennisbader Feb 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe what @FBruzzesi meant was that if no time_col is given, and the DF doesn't have an index (I assume that's what's the case with polars), should we assign a range index? Otherwise, the user would have to add this index manually to the polars df.

For pandas this case will never exist, but for the others.

We could do below, and for the beginning raise a warning instead of an error:

if time_index is None:
    time_index = pd.RangeIndex(len(df))
    logger.info(
        "No time column specified (`time_col=None`) and no index found in the DataFrame. Defaulting to "
        "`pandas.RangeIndex(len(df))`. If this is not desired consider adding a time column "
        "to your dataframe and defining `time_col`."
    )
elif not ...

@dennisbader dennisbader merged commit 24cec52 into unit8co:master Feb 28, 2025
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request Use this label to request a new feature improvement New feature or improvement
Projects
Status: Released
Development

Successfully merging this pull request may close these issues.

Add TimeSeries.from_polars
5 participants