Description
Hi! I came across this package because I have a dataset of ~2 million text sequences (each <500 chars long) and I wanted to get faster performance than sklearn's tfidf vectorizer while I play with different configurations. Sklearn's vectorizer is single threaded and written in python.
It takes about 5 minutes to vectorize and transform in sklearn in python:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(sublinear_tf=True,
norm='l2',
encoding='latin-1', ngram_range=(1, 2),
stop_words=None)
%time X = tfidf.fit_transform(dataset.text)
CPU times: user 4min 27s, sys: 14.1 s, total: 4min 41s
Wall time: 4min 49s
I can see on top that this is only using a single thread.
with text2vec (I hope I'm using it right! I tried to follow the example http://text2vec.org/vectorization.html#tf-idf):
dt = fread('dataset.csv.tar.gz')
setkey(dt, id)
prep_fun = tolower
tok_fun = word_tokenizer
my_iterator = itoken_parallel(dt$text,
preprocessor = prep_fun,
tokenizer = tok_fun,
ids = dt$id,
progressbar = TRUE)
t10 = Sys.time()
vocab = create_vocabulary(my_iterator, ngram=c(1L, 2L))
vectorizer = vocab_vectorizer(vocab)
dtm_train = create_dtm(my_iterator, vectorizer)
# define tfidf model
tfidf = TfIdf$new(norm = 'l2', sublinear_tf = TRUE)
# fit model to train data and transform train data with fitted model
dtm_train_tfidf = fit_transform(dtm_train, tfidf)
# tfidf modified by fit_transform() call!
paste('Time to build tfidf:', difftime(Sys.time(), t10, units = 'sec'))
I've left it running on an AWS. I can see on top that 4 threads are going. But they've been going much much longer than 5 minutes. Had to kill the process eventually. If I work on a smaller subset of a few thousand articles it works fine.
Am I missing something? Or do I just lack patience? Thanks for your help.