Skip to content

TFIDF.transform_many() fails on DataFrame input #1576

Open
@bdewilde

Description

Versions

river version: 0.21.2
Python version: 3.11.7
Operating system: macOS 14.4

Describe the bug

The TFIDF feature extractor claims to support both online and mini-batch transformations, but the latter case only works when the transformer doesn't specify the on parameter. In other words, batch mode works for pd.Series input, but not pd.Dataframe.

Steps/code to reproduce

import pandas as pd
import river.feature_extraction

model = river.feature_extraction.TFIDF()
X = pd.Series(["foo bar bat baz", "foo bar spam eggs"])
for rec in X:
    print(model.transform_one(rec))
# WORKS
# {'foo': 0.5, 'bar': 0.5, 'bat': 0.5, 'baz': 0.5}
# {'foo': 0.5, 'bar': 0.5, 'spam': 0.5, 'eggs': 0.5}
print(model.clone().transform_many(X))
# WORKS
#    foo  bar  bat  baz  spam  eggs
# 0    1    1    1    1     0     0
# 1    1    1    0    0     1     1

model = river.feature_extraction.TFIDF(on="text")
X = pd.DataFrame([{"text": "foo bar bat baz"}, {"text": "foo bar spam eggs"}])
for rec in X.to_dict(orient="records"):
    print(model.transform_one(rec))
# WORKS
# {'foo': 0.5, 'bar': 0.5, 'bat': 0.5, 'baz': 0.5}
# {'foo': 0.5, 'bar': 0.5, 'spam': 0.5, 'eggs': 0.5}
print(model.clone().transform_many(X))
# DOES NOT WORK

That last call produces the following traceback:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[95], line 1
----> 1 print(model.clone().transform_many(X))

File [~/.pyenv/versions/3.11.7/envs/ds/lib/python3.11/site-packages/river/feature_extraction/vectorize.py:349](http://localhost:8888/lab/tree/Desktop/notebooks/~/.pyenv/versions/3.11.7/envs/ds/lib/python3.11/site-packages/river/feature_extraction/vectorize.py#line=348), in BagOfWords.transform_many(self, X)
    347 for d in X:
    348     t: int
--> 349     for t, f in collections.Counter(self.process_text(d)).items():
    350         indices.append(index.setdefault(t, len(index)))
    351         data.append(f)

File [~/.pyenv/versions/3.11.7/envs/ds/lib/python3.11/site-packages/river/feature_extraction/vectorize.py:220](http://localhost:8888/lab/tree/Desktop/notebooks/~/.pyenv/versions/3.11.7/envs/ds/lib/python3.11/site-packages/river/feature_extraction/vectorize.py#line=219), in VectorizerMixin.process_text(self, x)
    218 def process_text(self, x):
    219     for step in self.processing_steps:
--> 220         x = step(x)
    221     return x

TypeError: string indices must be integers, not 'str'

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions