-
Notifications
You must be signed in to change notification settings - Fork 22
Description
For the following code:
import classla
nlp = classla.Pipeline(lang='sl', processors="tokenize,ner,pos,lemma")
nlp(["Geografsko leži Slovenija v srednji Evropi oziroma v jugovzhodni Evropi",
"V severozahodnem delu države prevladujejo Alpe"])
classla.Pipeline raises an error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.10/site-packages/classla/pipeline/core.py", line 168, in __call__
doc = self.process(doc)
File "/usr/local/lib/python3.10/site-packages/classla/pipeline/core.py", line 162, in process
doc = self.processors[processor_name].process(doc)
File "/usr/local/lib/python3.10/site-packages/classla/pipeline/tokenize_processor.py", line 74, in process
assert isinstance(document, str) or (self.config.get('pretokenized') or self.config.get('no_ssplit', False)), \
AssertionError: If neither 'pretokenized' or 'no_ssplit' option is enabled, the input to the TokenizerProcessor must be a string.
It would be nice if Pipeline supported the above use case and take a list of documents as input. I believe there is a partial support for batched inputs in classla (some processors support batch of documents that are processed in parallel), but with the AssertionError from above we are often forced to run our data one sentence at a time.
Please note that even if TokenizerProcessor cannot be implemented to process documents in parallel - it can still run all documents in a for-loop. Although it wouldn't speed up the tokenization part, some later processor in the pipeline can benefit from batched input (ie. by running on a GPU) and do its part much faster, thus speeding-up the whole Pipeline.