Skip to content

add TF-IDF Transformer #1014

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 8 commits into
base: main
Choose a base branch
from
Open

add TF-IDF Transformer #1014

wants to merge 8 commits into from

Conversation

pr38
Copy link

@pr38 pr38 commented Mar 11, 2025

Hi,

I am trying to add TfidfTransformer to dask. I am going to assume that adding TfidfVectorizer can be easily done later down the line, by stacking CountVectorizer and TfidfTransformer .

This is not the first attempt to add tfidf to dask-ml(https://github.com/dask/dask-ml/pull/869/files#diff-421c76129d7900a3ae83237eab5785858c0652d9f550d92e0202e45ebdcb2977). I can't comment on the differences between my attempt and previous attempts; but I can guaranty that my implementation only use's the dask array api, no other api.

@pr38 pr38 mentioned this pull request Mar 12, 2025
Copy link
Member

@stsievert stsievert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution! At first glance, this looks pretty good. I'd appreciate some more review.

All these comments and linting errors are minor. If you want to run the linting yourself, look at https://ml.dask.org/contributing.html#style

@pr38
Copy link
Author

pr38 commented Mar 14, 2025

ok, I am ready for more feedback.

I should mention that I am running compute_chunk_sizes when the number of rows are not know. I have seen maintainers complain about other's doing that here. If you guys don't want me to run compute_chunk_sizes , maybe we can make sure that CountVectorizer keeps chunk_sizes, to avoid foot guns.

@pr38 pr38 requested a review from stsievert March 17, 2025 22:00
@pr38 pr38 requested a review from stsievert March 24, 2025 04:13
@pr38
Copy link
Author

pr38 commented Mar 30, 2025

There are a couple of extra things I want to point out. While scikit-learn returns a scipy.sparse.coo_array,I return a sparse.COO. I don't know how well the rest of dask-ml plays with sparse.COO, and I am not sure how committed to 1 to 1 matching scikit-learn dask-ml is.

I have also seen two other implementations of '_handle_zeros_in_scale' in the dask-ml code base. I would have used the existing implementations but none used epsilon to catch near zero values.

@pr38 pr38 requested a review from stsievert March 30, 2025 23:08
Copy link
Member

@stsievert stsievert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You've also got some style errors. Could you look at run the linting commands at https://ml.dask.org/contributing.html#style ?

@pr38 pr38 requested a review from stsievert April 14, 2025 20:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants