Skip to content

[ENH] Support unequal length time series with TSFresh #3179

@jsquaredosquared

Description

@jsquaredosquared

Describe the feature or idea you want to propose

The TSFresh transformer and time series methods that use it do not support unequal-length time series, but according to the first question in the tsfresh FAQ it should be capable of doing so.

Describe your proposed solution

Modifying the function below to accept lists of 2D numpy arrays in addition to 3D numpy arrays:

def _from_3d_numpy_to_long(arr):
# Converting the numpy array to a long format DataFrame
n_cases, n_channels, n_timepoints = arr.shape
# Creating a DataFrame from the numpy array with multi-level index
df = pd.DataFrame(arr.reshape(n_cases * n_channels, n_timepoints))
df["case_index"] = np.repeat(np.arange(n_cases), n_channels)
df["dimension"] = np.tile(np.arange(n_channels), n_cases)
df = df.melt(
id_vars=["case_index", "dimension"], var_name="time_index", value_name="value"
)
# Adjusting the column order and renaming columns
df = df[["case_index", "time_index", "dimension", "value"]]
df = df.rename(columns={"case_index": "index", "dimension": "column"})
df["column"] = "dim_" + df["column"].astype(str)
return df

My attempt was incredibly slow on 3D numpy arrays:

def _from_3d_list_to_long(list_):
    def _convert_case_to_long_df(case, index):
        df = (
            pd.DataFrame(case)
            .transpose()
            .melt(var_name="column", ignore_index=False)
            .reset_index()
            .rename(columns={"index": "time_index"})
        )

        df["index"] = np.repeat(index, len(df))

        return df

    long_dfs = Parallel()(
        delayed(_convert_case_to_long_df)(case, index)
        for index, case in enumerate(list_)
    )

    combined_dfs = pd.concat(
        long_dfs,
        ignore_index=True,
    )
    combined_dfs = combined_dfs[["index", "time_index", "column", "value"]]
    combined_dfs["column"] = "dim_" + combined_dfs["column"].astype(str)

    return combined_dfs

I asked Gemini to speed it up, and this was the result:

def _from_3d_list_to_long_optimized(list_):
    def _convert_case_to_long_df(case, index):
        # NOTE: This is the optimized version using .T and .stack()
        df = pd.DataFrame(case).T
        df.columns = df.columns.map(lambda i: "dim_" + str(i))

        df_long = df.stack().reset_index()
        df_long.columns = ["time_index", "column", "value"]
        df_long["index"] = index
        return df_long

    # Keeping Parallel() as it is necessary for variable length lists
    long_dfs = Parallel()(  # Use n_jobs=-1 to use all cores
        delayed(_convert_case_to_long_df)(case, index)
        for index, case in enumerate(list_)
    )

    # pd.concat is still required to combine the results
    combined_dfs = pd.concat(
        long_dfs,
        ignore_index=True,
    )

    # Final column ordering (column renaming is now done inside the loop)
    combined_dfs = combined_dfs[["index", "time_index", "column", "value"]]

    return combined_dfs

Perhaps there is an even better way of implementing it,.
Or if the final speed cannot match the original function, a check could be included to determine which function to use based on the input type.

Describe alternatives you've considered, if relevant

No response

Additional context

No response

Metadata

Metadata

Labels

enhancementNew feature, improvement request or other non-bug code enhancementtransformationsTransformations package

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions