Skip to content

Error when trying to use nlp.pipe with n_process > 1 #179

Open
@DayalStrub

Description

@DayalStrub

Intro

I am getting TypeError: can not serialize 'BaseTextRank' object when trying to use spaCy's multiprocessing in nlp.pipe with a textrank pipeline component.

Sorry if this a known/expected feature/limitation - I couldn't find anything by searching repo. I generally find (spaCy's) multiprocessing a bit temperamental anyhow, but this seems to just not work.

PS. thanks for all the great work on the package!

Environment

Ubuntu 18.X (AWS DL AMI), Python 3.8 (via conda/mamba), pytextrank installed via pip, thtough conda - do let me know if you need more info.

Reproducible example - hopefullly

import spacy
import pytextrank

import en_core_web_sm

nlp = en_core_web_sm.load()
nlp.add_pipe("textrank", last=True);

txt = """
The Old Testament of the King James Bible
The First Book of Moses:  Called Genesis
1:1 In the beginning God created the heaven and the earth.
1:2 And the earth was without form, and void; and darkness was upon
the face of the deep. And the Spirit of God moved upon the face of the
waters.
1:3 And God said, Let there be light: and there was light.
1:4 And God saw the light, that it was good: and God divided the light
from the darkness.
1:5 And God called the light Day, and the darkness he called Night.
And the evening and the morning were the first day.
...
"""

data = []
for i in range(50):
    data.append((txt, {"doc_id": i}))

keys = []

for doc, context in nlp.pipe(data, as_tuples=True, n_process=-1): ## NOTE throws error, but hangs. work with n_process=1
    out = {"doc_id": context["doc_id"], "keyphrases": [(phr.text, phr.rank) for phr in doc._.phrases]}
    keys.append(out)
# pd.DataFrame(keys).head()

keys

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions