Skip to content

Multiple Errors Adapting GloVe Example to Project - Quanteda Related #334

Open
@sellociompi

Description

@sellociompi

Hello there,

I am having what I believe are multiple issues adapting the GloVe word embeddings tutorial to my project. I am starting with a tokens object created in Quanteda (TOK.Debates.2020.Full.Clean) to create the iterator. However, when I run that first line, I am greeted with this error:

Tokenizer_Debates_2020 = space_tokenizer(TOK.Debates.2020.Full.Clean)

_Warning message: In stringi::stri_split_fixed(strings, pattern = sep, ...) :
  argument is not an atomic vector; coercing_

The tokenizer is created and looks like this:

image

I continue the example with no errors:

Iterator_Debates_2020 = itoken(Tokenizer_Debates_2020)
Vocab_Debates_2020 = create_vocabulary(Iterator_Debates_2020)
Vocab_Debates_2020 = prune_vocabulary(Vocab_Debates_2020, term_count_min = 10L)
Vectorizer_Debates_2020 = vocab_vectorizer(Vocab_Debates_2020)
TCM_Debates_2020 = create_tcm(Iterator_Debates_2020, Vectorizer_Debates_2020, skip_grams_window = 5L)

I check the dimensions of the TCM and see that I have rows and columns:
dim(TCM_Debates_2020)

_[1] 9277 9277_

I start to fit the model, creating the glove environment with no issue, but when I try to do the actual fitting I obtain the following error:

glove = GlobalVectors$new(rank = 50, x_max = 10)
WV_Debates_2020 = glove$fit_transform(TCM_Debates_2020, n_iter = 10, convergence_tol = 0.01, n_threads = 8)

_Error in if (cost/n_nnz > 1) stop("Cost is too big, probably something goes wrong... try smaller learning rate") : 
  missing value where TRUE/FALSE needed_

In order to troubleshoot this error, I have tried to do the following:

  • Change the learning rate in the glove environment down to .001, still receive the same calculation cost error message
  • Attempted to change the initial token object into a text file to simulate the example better, still receive the same coercion error
  • Attempted to use a Quanteda FCM to replace the TCM, but receive the following error:

WV_Debates_2020 = glove$fit_transform(Debates2020.FCM, n_iter = 10, convergence_tol = 0.01, n_threads = 8)

_Error in glove$fit_transform(Debates2020.FCM, n_iter = 10, convergence_tol = 0.01,  : 
  all(x@x > 0) is not TRUE_

I have been unable to proceed further and obviously one or more of these errors must be the culprit, but I have been unable to find documentation on these errors elsewhere, including past issues catalogued here.

Thank you in advance for any help in taking out this gremlin.
-Sello

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions