diff --git a/examples/distances/sklearn_distances.ipynb b/examples/distances/sklearn_distances.ipynb index e22579828c..4f3129e50d 100644 --- a/examples/distances/sklearn_distances.ipynb +++ b/examples/distances/sklearn_distances.ipynb @@ -29,7 +29,7 @@ }, { "cell_type": "code", - "execution_count": 23, + "execution_count": 5, "metadata": { "ExecuteTime": { "end_time": "2024-06-17T23:21:23.882870Z", @@ -70,7 +70,7 @@ }, { "cell_type": "code", - "execution_count": 24, + "execution_count": 6, "metadata": { "ExecuteTime": { "end_time": "2024-06-17T23:21:24.416318Z", @@ -83,7 +83,7 @@ "name": "stdout", "output_type": "stream", "text": [ - "Training set shape = (50, 1, 150) -> this works with sklearn\n", + "Training set shape = (50, 150) -> this works with sklearn\n", "kNN with manhattan distance on 2D time series data ['1' '2' '2' '1' '1']\n", "\n", "Training set shape = (50, 1, 150) -> sklearn will crash as is a 3D array\n", @@ -127,7 +127,7 @@ }, { "cell_type": "code", - "execution_count": 25, + "execution_count": 7, "metadata": { "ExecuteTime": { "end_time": "2024-06-17T23:21:26.806946Z", @@ -184,7 +184,7 @@ }, { "cell_type": "code", - "execution_count": 26, + "execution_count": 8, "metadata": {}, "outputs": [ { @@ -193,13 +193,13 @@ "text": [ "Graph of neighbors for the first pattern in testing set with EDR distance on 2Dtime series data:\n", "[[0. 0. 0. 0. 0. 0.\n", + " 0. 0. 0. 0. 0. 0.01333333\n", " 0. 0. 0. 0. 0. 0.\n", + " 0. 0. 0. 0. 0.01333333 0.\n", + " 0. 0. 0. 0.01333333 0. 0.\n", " 0. 0. 0. 0. 0. 0.\n", " 0. 0. 0. 0. 0. 0.\n", " 0. 0. 0. 0. 0. 0.\n", - " 0. 0. 0. 0. 0.00666667 0.00666667\n", - " 0.00666667 0. 0. 0. 0. 0.\n", - " 0. 0. 0. 0. 0. 0.\n", " 0. 0. ]]\n", "Note that [i,j] has the weight of edge that connects i to j.\n", "\n", @@ -259,7 +259,7 @@ }, { "cell_type": "code", - "execution_count": 27, + "execution_count": 9, "metadata": { "ExecuteTime": { "end_time": "2024-06-17T23:21:31.418483Z", @@ -316,7 +316,7 @@ }, { "cell_type": "code", - "execution_count": 29, + "execution_count": 10, "metadata": {}, "outputs": [ { @@ -357,14 +357,14 @@ }, { "cell_type": "code", - "execution_count": 30, + "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "aeon kNN with MSM distance on 3D MTSC time series data ['standing' 'walking' 'running' 'standing' 'standing']\n", + "aeon kNN with MSM distance on 3D MTSC time series data ['standing' 'badminton' 'running' 'standing' 'standing']\n", "sklearn kNN with MSM distance on 2D MTSC time series data ['standing' 'badminton' 'running' 'standing' 'standing']\n" ] } @@ -404,7 +404,7 @@ }, { "cell_type": "code", - "execution_count": 31, + "execution_count": 12, "metadata": {}, "outputs": [], "source": [ @@ -432,7 +432,7 @@ }, { "cell_type": "code", - "execution_count": 32, + "execution_count": 13, "metadata": {}, "outputs": [ { @@ -492,7 +492,7 @@ }, { "cell_type": "code", - "execution_count": 33, + "execution_count": 14, "metadata": { "ExecuteTime": { "end_time": "2024-06-17T23:21:32.354262Z", @@ -552,7 +552,7 @@ }, { "cell_type": "code", - "execution_count": 34, + "execution_count": 15, "metadata": {}, "outputs": [ { @@ -608,7 +608,7 @@ }, { "cell_type": "code", - "execution_count": 38, + "execution_count": 16, "metadata": {}, "outputs": [], "source": [ @@ -630,7 +630,7 @@ }, { "cell_type": "code", - "execution_count": 42, + "execution_count": 17, "metadata": { "ExecuteTime": { "end_time": "2024-06-17T23:21:32.360245Z", @@ -643,8 +643,7 @@ "name": "stdout", "output_type": "stream", "text": [ - "SVC with TWE first five predictions = ['2' '1' '1' '2' '1']\n", - "Time with callable function: 2.102813597768545\n" + "SVC with TWE first five predictions = ['2' '1' '1' '2' '1']\n" ] } ], @@ -668,15 +667,14 @@ }, { "cell_type": "code", - "execution_count": 43, + "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "SVC with precomputed TWE first five predictions = ['2' '1' '1' '2' '1']\n", - "Time with precomputed distances: 12.786979459226131\n" + "SVC with precomputed TWE first five predictions = ['2' '1' '1' '2' '1']\n" ] } ], @@ -702,7 +700,7 @@ }, { "cell_type": "code", - "execution_count": 45, + "execution_count": 19, "metadata": {}, "outputs": [], "source": [ @@ -721,7 +719,7 @@ }, { "cell_type": "code", - "execution_count": 50, + "execution_count": 20, "metadata": {}, "outputs": [ { @@ -768,7 +766,7 @@ }, { "cell_type": "code", - "execution_count": 55, + "execution_count": 21, "metadata": { "ExecuteTime": { "end_time": "2024-06-17T23:21:40.488336Z", @@ -798,7 +796,7 @@ }, { "cell_type": "code", - "execution_count": 60, + "execution_count": 22, "metadata": { "ExecuteTime": { "end_time": "2024-06-17T23:21:41.393371Z", @@ -866,7 +864,7 @@ }, { "cell_type": "code", - "execution_count": 65, + "execution_count": 23, "metadata": { "ExecuteTime": { "end_time": "2024-06-17T23:21:41.472132Z", @@ -881,7 +879,7 @@ "text": [ "Shape (FunctionTransformer) = (67, 67)\n", "Shape (msm_pairwise_distance) = (67, 67)\n", - "These values are the same 7.595223506000001, 7.595223506000001 and 7.595223506000001.\n" + "These values are the same 7.595223506000001, 7.595223506000001 and 7.595223506000001.\n" ] } ], @@ -919,7 +917,7 @@ }, { "cell_type": "code", - "execution_count": 68, + "execution_count": 24, "metadata": { "ExecuteTime": { "end_time": "2024-06-17T23:21:41.503125Z", @@ -949,6 +947,235 @@ " pipe.predict(X)[:5],\n", ")" ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Data Formatting Guidance\n", + "`aeon` works with **3D time-series data**, where: \n", + "- `n_cases` represents the number of samples (time series instances). \n", + "- `n_channels` refers to the number of features or variables measured. \n", + "- `n_timepoints` is the number of time steps in each series. \n", + "\n", + "However, **scikit-learn (`sklearn`) algorithms expect 2D input** in the shape of `(n_samples, n_features)`. Since most `sklearn` models cannot directly process 3D time-series data, we must **flatten** each time series from `(n_cases, n_channels, n_timepoints)` into a **2D format** of shape `(n_cases, n_channels * n_timepoints)`. \n" + ] + }, + { + "cell_type": "code", + "execution_count": 25, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Original shape: (50, 1, 150)\n", + "Converted shape (for sklearn): (50, 150)\n" + ] + } + ], + "source": [ + "# aeon works with 3D time-series data: (n_cases, n_channels, n_timepoints)\n", + "print(f\"Original shape: {X_train_3D.shape}\")\n", + "\n", + "# Convert 3D to 2D for sklearn compatibility (flatten time series)\n", + "X_train_2D_flat = X_train_3D.reshape(X_train_3D.shape[0], -1)\n", + "X_test_2D_flat = X_test_3D.reshape(X_test_3D.shape[0], -1)\n", + "\n", + "print(f\"Converted shape (for sklearn): {X_train_2D_flat.shape}\")" + ] + }, + { + "cell_type": "code", + "execution_count": 26, + "metadata": {}, + "outputs": [], + "source": [ + "y_train_3D = y_train_3D.astype(float)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Using aeon distances in sklearn regression" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We apply k-Nearest Neighbors Regression using Dynamic Time Warping (DTW) as the distance metric. The model learns from the preprocessed dataset and predicts continuous values for test instances." + ] + }, + { + "cell_type": "code", + "execution_count": 27, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "kNN Regressor with DTW distance on time series data: [1. 1.8 2. 1. 1.6]\n" + ] + } + ], + "source": [ + "# Apply KNeighborsRegressor with DTW distance\n", + "knn_regressor = KNeighborsRegressor(metric=dtw_distance)\n", + "knn_regressor.fit(X_train_2D_flat, y_train_3D)\n", + "\n", + "# Make predictions\n", + "predictions_reg = knn_regressor.predict(X_test_2D_flat[:5])\n", + "\n", + "print(f\"kNN Regressor with DTW distance on time series data: {predictions_reg}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Using aeon with sklearn for Classification" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We use k-Nearest Neighbors Classification with Dynamic Time Warping (DTW) to measure similarity between time series data. The classifier is trained on distance matrices and predicts class labels for test samples." + ] + }, + { + "cell_type": "code", + "execution_count": 28, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Classification Accuracy: 0.7085714285714285\n" + ] + } + ], + "source": [ + "from aeon.datasets import load_arrow_head # Use a working dataset\n", + "from aeon.distances import pairwise_distance\n", + "\n", + "# Load example dataset (3D time-series format)\n", + "X_train_3D, y_train = load_arrow_head(split=\"train\")\n", + "X_test_3D, y_test = load_arrow_head(split=\"test\")\n", + "\n", + "# Convert 3D data into 2D (Flatten)\n", + "X_train_2D = X_train_3D.reshape(X_train_3D.shape[0], -1)\n", + "X_test_2D = X_test_3D.reshape(X_test_3D.shape[0], -1)\n", + "\n", + "# Compute DTW distance matrices using \"dtw\" as a method\n", + "X_train_dist = pairwise_distance(\n", + " X_train_3D, X_train_3D, method=\"dtw\"\n", + ") # ✅ Use \"method\"\n", + "X_test_dist = pairwise_distance(\n", + " X_test_3D, X_train_3D, method=\"dtw\"\n", + ") # ✅ Ensure consistent shape\n", + "\n", + "# Create classifier using computed distances\n", + "clf = KNeighborsClassifier(n_neighbors=3, metric=\"precomputed\")\n", + "clf.fit(X_train_dist, y_train)\n", + "\n", + "# Predict and evaluate\n", + "y_pred = clf.predict(X_test_dist)\n", + "print(\"Classification Accuracy:\", accuracy_score(y_test, y_pred))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Using aeon with sklearn for Clustering" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "This section applies K-Means clustering to time-series data using the Dynamic Time Warping (DTW) distance as a similarity measure. Instead of raw data, we use precomputed distance matrices, ensuring compatibility with clustering algorithms." + ] + }, + { + "cell_type": "code", + "execution_count": 29, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Cluster labels: [0 0 0 1 0 0 1 0 0 1 0 0 1 0 2 1 2 2 1 0 0 0 0 1 1 0 1 0 0 0 0 0 2 1 0 0]\n" + ] + } + ], + "source": [ + "from sklearn.cluster import KMeans\n", + "\n", + "from aeon.distances import pairwise_distance\n", + "\n", + "# Compute DTW distance matrix\n", + "X_dist = pairwise_distance(X_train_3D, X_train_3D, method=\"dtw\")\n", + "\n", + "# Apply K-Means clustering with precomputed distances\n", + "kmeans = KMeans(n_clusters=3, random_state=42)\n", + "labels = kmeans.fit_predict(X_dist)\n", + "\n", + "# Display results\n", + "print(\"Cluster labels:\", labels)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Cross-validation with aeon and sklearn" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "This section demonstrates **cross-validation** for a classification model using **Dynamic Time Warping (DTW) distances**. Cross-validation helps evaluate model performance by testing on different subsets of the dataset. \n", + "\n", + "Steps: \n", + "1. Use **5-fold cross-validation** to train and test the **kNN classifier** with precomputed **DTW distances**. \n", + "2. Compute **accuracy scores** for each fold. \n", + "3. Calculate the **mean accuracy** to assess overall model performance. \n", + "\n", + "This ensures a robust evaluation by reducing variance and preventing overfitting." + ] + }, + { + "cell_type": "code", + "execution_count": 30, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Cross-validation scores: [0.625 0.85714286 0.71428571 0.57142857 0.57142857]\n", + "Mean accuracy: 0.6678571428571429\n" + ] + } + ], + "source": [ + "# Example: Cross-validation using aeon distances in sklearn\n", + "from sklearn.model_selection import cross_val_score\n", + "\n", + "# Cross-validation on classification pipeline\n", + "scores = cross_val_score(clf, X_train_dist, y_train, cv=5, scoring=\"accuracy\")\n", + "print(\"Cross-validation scores:\", scores)\n", + "print(\"Mean accuracy:\", scores.mean())" + ] } ], "metadata": { @@ -967,7 +1194,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.11.5" + "version": "3.12.2" } }, "nbformat": 4,