Open
Description
Versions
river version: 0.21.2
Python version: 3.11.7
Operating system: macOS 14.4
Describe the bug
The TFIDF
feature extractor claims to support both online and mini-batch transformations, but the latter case only works when the transformer doesn't specify the on
parameter. In other words, batch mode works for pd.Series
input, but not pd.Dataframe
.
Steps/code to reproduce
import pandas as pd
import river.feature_extraction
model = river.feature_extraction.TFIDF()
X = pd.Series(["foo bar bat baz", "foo bar spam eggs"])
for rec in X:
print(model.transform_one(rec))
# WORKS
# {'foo': 0.5, 'bar': 0.5, 'bat': 0.5, 'baz': 0.5}
# {'foo': 0.5, 'bar': 0.5, 'spam': 0.5, 'eggs': 0.5}
print(model.clone().transform_many(X))
# WORKS
# foo bar bat baz spam eggs
# 0 1 1 1 1 0 0
# 1 1 1 0 0 1 1
model = river.feature_extraction.TFIDF(on="text")
X = pd.DataFrame([{"text": "foo bar bat baz"}, {"text": "foo bar spam eggs"}])
for rec in X.to_dict(orient="records"):
print(model.transform_one(rec))
# WORKS
# {'foo': 0.5, 'bar': 0.5, 'bat': 0.5, 'baz': 0.5}
# {'foo': 0.5, 'bar': 0.5, 'spam': 0.5, 'eggs': 0.5}
print(model.clone().transform_many(X))
# DOES NOT WORK
That last call produces the following traceback:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[95], line 1
----> 1 print(model.clone().transform_many(X))
File [~/.pyenv/versions/3.11.7/envs/ds/lib/python3.11/site-packages/river/feature_extraction/vectorize.py:349](http://localhost:8888/lab/tree/Desktop/notebooks/~/.pyenv/versions/3.11.7/envs/ds/lib/python3.11/site-packages/river/feature_extraction/vectorize.py#line=348), in BagOfWords.transform_many(self, X)
347 for d in X:
348 t: int
--> 349 for t, f in collections.Counter(self.process_text(d)).items():
350 indices.append(index.setdefault(t, len(index)))
351 data.append(f)
File [~/.pyenv/versions/3.11.7/envs/ds/lib/python3.11/site-packages/river/feature_extraction/vectorize.py:220](http://localhost:8888/lab/tree/Desktop/notebooks/~/.pyenv/versions/3.11.7/envs/ds/lib/python3.11/site-packages/river/feature_extraction/vectorize.py#line=219), in VectorizerMixin.process_text(self, x)
218 def process_text(self, x):
219 for step in self.processing_steps:
--> 220 x = step(x)
221 return x
TypeError: string indices must be integers, not 'str'
Metadata
Metadata
Assignees
Labels
No labels
Activity