Open
Description
Versions
river version: 0.15.0
Python version: 3.10.4
Operating system: macOS Ventura 13.2
Describe the bug
I tried to run a river's clusterer (CluStream, specifically) and update the metrics with each iteration. However, I received a key error. A fully-reproducible example is below.
Steps/code to reproduce
import pandas as pd
from river.cluster import CluStream
from river import stream
from river.metrics import Silhouette
# Import the data
s1 = pd.read_table('http://cs.uef.fi/sipu/datasets/s1.txt',
sep = "\s+",
names = ['x1', 'x2']).sample(5000, random_state = 42).reset_index(drop = True)
# Taking a random sample for a smaller batch of the data
n_samples = 500
df_first_batch = s1.sample(n_samples).reset_index(drop = True)
clusterer = CluStream(time_window=1,
max_micro_clusters=30,
n_macro_clusters=15,
seed=0,
halflife=0.4
)
metric = Silhouette()
for x, _ in stream.iter_pandas(df_first_batch):
clusterer = clusterer.learn_one(x)
y_pred = clusterer.predict_one(x)
metric = metric.update(x = x,
y_pred = y_pred,
centers = clusterer.centers)
Here's the output:
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
Cell In[2], line 26
24 clusterer = clusterer.learn_one(x)
25 y_pred = clusterer.predict_one(x)
---> 26 metric = metric.update(x = x,
27 y_pred = y_pred,
28 centers = clusterer.centers)
File /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/river/metrics/silhouette.py:71, in Silhouette.update(self, x, y_pred, centers, sample_weight)
69 def update(self, x, y_pred, centers, sample_weight=1.0):
---> 71 distance_closest_centroid = math.sqrt(utils.math.minkowski_distance(centers[y_pred], x, 2))
72 self._sum_distance_closest_centroid += distance_closest_centroid
74 distance_second_closest_centroid = self._find_distance_second_closest_center(centers, x)
KeyError: 0