Skip to content

In vectorizer.fit_transform() function, when tf_type="log" we get UFuncTypeError: Cannot cast ufunc 'add' output from dtype('float64') to dtype('int32') with casting rule 'same_kind' #288

@rohetoric

Description

@rohetoric

steps to reproduce

  1. Read a text file.

  2. Set the value of the following parameters one by one
    tf_type=["linear", "sqrt", "log", "binary"]
    idf_type = ["standard", "smooth", "bm25"]
    dl_type= ["linear", "sqrt", "log"]
    norm =["l1", "l2"]
    models= ["lsa","lda","nmf"]

  3. Iterate with a nested loop along values of all 5 parameters and compute doc_term_matrix
    ie
    for t in tf_type: for i in idf_type: for d in dl_type: for n in norm: for mo in models: vectorizer = textacy.vsm.Vectorizer(tf_type=t, apply_idf=True, idf_type=i,dl_type=d, norm=n,min_df=2, max_df=0.95) doc_term_matrix = vectorizer.fit_transform((doc._.to_terms_list(ngrams=3, entities=True, as_strings=True)for doc in spacy_gram))

  4. When the tf_type="log", we receive the above error.

expected vs. actual behavior

possible solution?

I saw that inside the vectroizer.fit_transform there is a function _reweight_values(self, doc_term_matrix) function. When the tf_type="log", we read np.log(doc_term_matrix.data, doc_term_matrix.data, casting="unsafe"). Even though the casting has been declared as "unsafe", there is error is on the next line i.e doc_term_matrix.data += 1.0. I think it should be initialized as doc_term_matrix.data = doc_term_matrix.data+1.0 according to https://stackoverflow.com/questions/38673531/multiply-numpy-int-and-float-arrays-cannot-cast-ufunc-multiply-output-from-dtyp

context

I am trying to get clusters with similar intent according to my dataset and for that I need the document term matrix. I am just using the brute force method as to when I can receive the best silhouette score of the cluster based on tweaking the parameters of the vectorizer function in a loop.

environment

Receving an TypeError here in print_markdown(items) i.e.TypeError:s must be (<class 'str'>, <class 'bytes'>), not <class 'list'> inside the to_unicode(s, encoding, errors) function.

  • operating system: Ubuntu 18.04
  • python version: Python 3.7.4
  • spacy version: 2.2.3
  • installed spacy models: en_core_web_sm, en_core_web_md,
  • textacy version: 0.9.1

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions