Skip to content

tfidf fitting much slower than expected #335

Open
@bogedy

Description

@bogedy

Hi! I came across this package because I have a dataset of ~2 million text sequences (each <500 chars long) and I wanted to get faster performance than sklearn's tfidf vectorizer while I play with different configurations. Sklearn's vectorizer is single threaded and written in python.

It takes about 5 minutes to vectorize and transform in sklearn in python:

from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(sublinear_tf=True,
                        norm='l2',
                        encoding='latin-1', ngram_range=(1, 2),
                        stop_words=None)

%time X = tfidf.fit_transform(dataset.text)
CPU times: user 4min 27s, sys: 14.1 s, total: 4min 41s
Wall time: 4min 49s

I can see on top that this is only using a single thread.

with text2vec (I hope I'm using it right! I tried to follow the example http://text2vec.org/vectorization.html#tf-idf):

dt = fread('dataset.csv.tar.gz')

setkey(dt, id)

prep_fun = tolower
tok_fun = word_tokenizer

my_iterator = itoken_parallel(dt$text,
                  preprocessor = prep_fun,
                  tokenizer = tok_fun,
                  ids = dt$id,
                  progressbar = TRUE)

t10 = Sys.time()
vocab = create_vocabulary(my_iterator, ngram=c(1L, 2L))
vectorizer = vocab_vectorizer(vocab)
dtm_train = create_dtm(my_iterator, vectorizer)

# define tfidf model
tfidf = TfIdf$new(norm = 'l2', sublinear_tf = TRUE)
# fit model to train data and transform train data with fitted model
dtm_train_tfidf = fit_transform(dtm_train, tfidf)
# tfidf modified by fit_transform() call!

paste('Time to build tfidf:', difftime(Sys.time(), t10, units = 'sec'))

I've left it running on an AWS. I can see on top that 4 threads are going. But they've been going much much longer than 5 minutes. Had to kill the process eventually. If I work on a smaller subset of a few thousand articles it works fine.

Am I missing something? Or do I just lack patience? Thanks for your help.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions