tfidf fitting much slower than expected

Hi! I came across this package because I have a dataset of ~2 million text sequences (each <500 chars long) and I wanted to get faster performance than sklearn's tfidf vectorizer while I play with different configurations. Sklearn's vectorizer is single threaded and written in python.

It takes about 5 minutes to vectorize and transform in sklearn in python:
```python
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(sublinear_tf=True,
                        norm='l2',
                        encoding='latin-1', ngram_range=(1, 2),
                        stop_words=None)

%time X = tfidf.fit_transform(dataset.text)
```
```
CPU times: user 4min 27s, sys: 14.1 s, total: 4min 41s
Wall time: 4min 49s
```

I can see on top that this is only using a single thread.

with text2vec (I hope I'm using it right! I tried to follow the example http://text2vec.org/vectorization.html#tf-idf):

```R
dt = fread('dataset.csv.tar.gz')

setkey(dt, id)

prep_fun = tolower
tok_fun = word_tokenizer

my_iterator = itoken_parallel(dt$text,
                  preprocessor = prep_fun,
                  tokenizer = tok_fun,
                  ids = dt$id,
                  progressbar = TRUE)

t10 = Sys.time()
vocab = create_vocabulary(my_iterator, ngram=c(1L, 2L))
vectorizer = vocab_vectorizer(vocab)
dtm_train = create_dtm(my_iterator, vectorizer)

# define tfidf model
tfidf = TfIdf$new(norm = 'l2', sublinear_tf = TRUE)
# fit model to train data and transform train data with fitted model
dtm_train_tfidf = fit_transform(dtm_train, tfidf)
# tfidf modified by fit_transform() call!

paste('Time to build tfidf:', difftime(Sys.time(), t10, units = 'sec'))
```
I've left it running on an AWS. I can see on top that 4 threads are going. But they've been going much much longer than 5 minutes. Had to kill the process eventually. If I work on a smaller subset of a few thousand articles it works fine.

Am I missing something? Or do I just lack patience? Thanks for your help.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

tfidf fitting much slower than expected #335

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

tfidf fitting much slower than expected #335

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions