-
-
Notifications
You must be signed in to change notification settings - Fork 262
add TF-IDF Transformer #1014
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
add TF-IDF Transformer #1014
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the contribution! At first glance, this looks pretty good. I'd appreciate some more review.
All these comments and linting errors are minor. If you want to run the linting yourself, look at https://ml.dask.org/contributing.html#style
Co-authored-by: Scott Sievert <[email protected]>
Co-authored-by: Scott Sievert <[email protected]>
ok, I am ready for more feedback. I should mention that I am running |
There are a couple of extra things I want to point out. While scikit-learn returns a scipy.sparse.coo_array,I return a sparse.COO. I don't know how well the rest of dask-ml plays with sparse.COO, and I am not sure how committed to 1 to 1 matching scikit-learn dask-ml is. I have also seen two other implementations of '_handle_zeros_in_scale' in the dask-ml code base. I would have used the existing implementations but none used epsilon to catch near zero values. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You've also got some style errors. Could you look at run the linting commands at https://ml.dask.org/contributing.html#style ?
Hi,
I am trying to add TfidfTransformer to dask. I am going to assume that adding TfidfVectorizer can be easily done later down the line, by stacking CountVectorizer and TfidfTransformer .
This is not the first attempt to add tfidf to dask-ml(https://github.com/dask/dask-ml/pull/869/files#diff-421c76129d7900a3ae83237eab5785858c0652d9f550d92e0202e45ebdcb2977). I can't comment on the differences between my attempt and previous attempts; but I can guaranty that my implementation only use's the dask array api, no other api.