steps to reproduce
-
Read a text file.
-
Set the value of the following parameters one by one
tf_type=["linear", "sqrt", "log", "binary"]
idf_type = ["standard", "smooth", "bm25"]
dl_type= ["linear", "sqrt", "log"]
norm =["l1", "l2"]
models= ["lsa","lda","nmf"]
-
Iterate with a nested loop along values of all 5 parameters and compute doc_term_matrix
ie
for t in tf_type: for i in idf_type: for d in dl_type: for n in norm: for mo in models: vectorizer = textacy.vsm.Vectorizer(tf_type=t, apply_idf=True, idf_type=i,dl_type=d, norm=n,min_df=2, max_df=0.95) doc_term_matrix = vectorizer.fit_transform((doc._.to_terms_list(ngrams=3, entities=True, as_strings=True)for doc in spacy_gram))
-
When the tf_type="log", we receive the above error.
expected vs. actual behavior
possible solution?
I saw that inside the vectroizer.fit_transform
there is a function _reweight_values(self, doc_term_matrix)
function. When the tf_type="log"
, we read np.log(doc_term_matrix.data, doc_term_matrix.data, casting="unsafe")
. Even though the casting has been declared as "unsafe", there is error is on the next line i.e doc_term_matrix.data += 1.0
. I think it should be initialized as doc_term_matrix.data = doc_term_matrix.data+1.0
according to https://stackoverflow.com/questions/38673531/multiply-numpy-int-and-float-arrays-cannot-cast-ufunc-multiply-output-from-dtyp
context
I am trying to get clusters with similar intent according to my dataset and for that I need the document term matrix. I am just using the brute force method as to when I can receive the best silhouette score of the cluster based on tweaking the parameters of the vectorizer function in a loop.
environment
Receving an TypeError here in print_markdown(items)
i.e.TypeError:
s must be (<class 'str'>, <class 'bytes'>), not <class 'list'>
inside the to_unicode(s, encoding, errors)
function.
- operating system: Ubuntu 18.04
- python version: Python 3.7.4
spacy
version: 2.2.3
- installed
spacy
models: en_core_web_sm, en_core_web_md,
textacy
version: 0.9.1