Open
Description
Intro
I am getting TypeError: can not serialize 'BaseTextRank' object
when trying to use spaCy's multiprocessing in nlp.pipe
with a textrank
pipeline component.
Sorry if this a known/expected feature/limitation - I couldn't find anything by searching repo. I generally find (spaCy's) multiprocessing a bit temperamental anyhow, but this seems to just not work.
PS. thanks for all the great work on the package!
Environment
Ubuntu 18.X (AWS DL AMI), Python 3.8 (via conda/mamba), pytextrank installed via pip, thtough conda - do let me know if you need more info.
Reproducible example - hopefullly
import spacy
import pytextrank
import en_core_web_sm
nlp = en_core_web_sm.load()
nlp.add_pipe("textrank", last=True);
txt = """
The Old Testament of the King James Bible
The First Book of Moses: Called Genesis
1:1 In the beginning God created the heaven and the earth.
1:2 And the earth was without form, and void; and darkness was upon
the face of the deep. And the Spirit of God moved upon the face of the
waters.
1:3 And God said, Let there be light: and there was light.
1:4 And God saw the light, that it was good: and God divided the light
from the darkness.
1:5 And God called the light Day, and the darkness he called Night.
And the evening and the morning were the first day.
...
"""
data = []
for i in range(50):
data.append((txt, {"doc_id": i}))
keys = []
for doc, context in nlp.pipe(data, as_tuples=True, n_process=-1): ## NOTE throws error, but hangs. work with n_process=1
out = {"doc_id": context["doc_id"], "keyphrases": [(phr.text, phr.rank) for phr in doc._.phrases]}
keys.append(out)
# pd.DataFrame(keys).head()
keys