add TF-IDF Transformer #1014

pr38 · 2025-03-11T22:07:30Z

Hi,

I am trying to add TfidfTransformer to dask. I am going to assume that adding TfidfVectorizer can be easily done later down the line, by stacking CountVectorizer and TfidfTransformer .

This is not the first attempt to add tfidf to dask-ml(https://github.com/dask/dask-ml/pull/869/files#diff-421c76129d7900a3ae83237eab5785858c0652d9f550d92e0202e45ebdcb2977). I can't comment on the differences between my attempt and previous attempts; but I can guaranty that my implementation only use's the dask array api, no other api.

stsievert

Thanks for the contribution! At first glance, this looks pretty good. I'd appreciate some more review.

All these comments and linting errors are minor. If you want to run the linting yourself, look at https://ml.dask.org/contributing.html#style

dask_ml/feature_extraction/text.py

Co-authored-by: Scott Sievert <[email protected]>

pr38 · 2025-03-14T01:06:26Z

ok, I am ready for more feedback.

I should mention that I am running compute_chunk_sizes when the number of rows are not know. I have seen maintainers complain about other's doing that here. If you guys don't want me to run compute_chunk_sizes , maybe we can make sure that CountVectorizer keeps chunk_sizes, to avoid foot guns.

dask_ml/feature_extraction/text.py

pr38 · 2025-03-30T23:08:19Z

There are a couple of extra things I want to point out. While scikit-learn returns a scipy.sparse.coo_array,I return a sparse.COO. I don't know how well the rest of dask-ml plays with sparse.COO, and I am not sure how committed to 1 to 1 matching scikit-learn dask-ml is.

I have also seen two other implementations of '_handle_zeros_in_scale' in the dask-ml code base. I would have used the existing implementations but none used epsilon to catch near zero values.

stsievert

You've also got some style errors. Could you look at run the linting commands at https://ml.dask.org/contributing.html#style ?

dask_ml/feature_extraction/text.py

pr38 · 2025-05-07T19:19:55Z

@stsievert what else would you need for this pr to be merged/approved?

add tfidf

c9fdaa6

pr38 mentioned this pull request Mar 12, 2025

Distributed TFIDF #5

Open

fix _normalize and tests

9b97306

stsievert reviewed Mar 13, 2025

View reviewed changes

dask_ml/feature_extraction/text.py Outdated Show resolved Hide resolved

dask_ml/feature_extraction/text.py Outdated Show resolved Hide resolved

dask_ml/feature_extraction/text.py Outdated Show resolved Hide resolved

pr38 and others added 3 commits March 13, 2025 17:28

Update dask_ml/feature_extraction/text.py

ae5aeb7

Co-authored-by: Scott Sievert <[email protected]>

Update dask_ml/feature_extraction/text.py

f900275

Co-authored-by: Scott Sievert <[email protected]>

formatting fixes

8b597d5

pr38 requested a review from stsievert March 17, 2025 22:00

stsievert reviewed Mar 22, 2025

View reviewed changes

dask_ml/feature_extraction/text.py Outdated Show resolved Hide resolved

add lazy row counts

fd8469e

pr38 requested a review from stsievert March 24, 2025 04:13

stsievert reviewed Mar 27, 2025

View reviewed changes

dask_ml/feature_extraction/text.py Outdated Show resolved Hide resolved

dask_ml/feature_extraction/text.py Outdated Show resolved Hide resolved

dask_ml/feature_extraction/text.py Outdated Show resolved Hide resolved

more style + lazy row count update

f63f090

pr38 requested a review from stsievert March 30, 2025 23:08

stsievert reviewed Apr 8, 2025

View reviewed changes

dask_ml/feature_extraction/text.py Outdated Show resolved Hide resolved

lazy row counts for odd chunks + test updates

a5ca024

pr38 requested a review from stsievert April 14, 2025 20:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

add TF-IDF Transformer #1014

add TF-IDF Transformer #1014

Uh oh!

pr38 commented Mar 11, 2025 •

edited

Loading

Uh oh!

stsievert left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pr38 commented Mar 14, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pr38 commented Mar 30, 2025 •

edited

Loading

Uh oh!

stsievert left a comment

Uh oh!

Uh oh!

pr38 commented May 7, 2025

Uh oh!

Uh oh!

Uh oh!

add TF-IDF Transformer #1014

Are you sure you want to change the base?

add TF-IDF Transformer #1014

Uh oh!

Conversation

pr38 commented Mar 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stsievert left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pr38 commented Mar 14, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pr38 commented Mar 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stsievert left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

pr38 commented May 7, 2025

Uh oh!

Uh oh!

pr38 commented Mar 11, 2025 •

edited

Loading

pr38 commented Mar 30, 2025 •

edited

Loading