Please add support for a list of documents as input to all Pipeline processors

For the following code:
``` 
import classla
nlp = classla.Pipeline(lang='sl', processors="tokenize,ner,pos,lemma")
nlp(["Geografsko leži Slovenija v srednji Evropi oziroma v jugovzhodni Evropi",
       "V severozahodnem delu države prevladujejo Alpe"])
```
classla.Pipeline raises an error:
```
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.10/site-packages/classla/pipeline/core.py", line 168, in __call__
    doc = self.process(doc)
  File "/usr/local/lib/python3.10/site-packages/classla/pipeline/core.py", line 162, in process
    doc = self.processors[processor_name].process(doc)
  File "/usr/local/lib/python3.10/site-packages/classla/pipeline/tokenize_processor.py", line 74, in process
    assert isinstance(document, str) or (self.config.get('pretokenized') or self.config.get('no_ssplit', False)), \
AssertionError: If neither 'pretokenized' or 'no_ssplit' option is enabled, the input to the TokenizerProcessor must be a string.
```
It would be nice if Pipeline supported the above use case and take a list of documents as input. I believe there is a partial support for batched inputs in classla (some processors support batch of documents that are processed in parallel), but with the AssertionError from above we are often forced to run our data one sentence at a time.

Please note that even if TokenizerProcessor cannot be implemented to process documents in parallel - it can still run all documents in a for-loop. Although it wouldn't speed up the tokenization part, some later processor in the pipeline can benefit from batched input (ie. by running on a GPU) and do its part much faster, thus speeding-up the whole Pipeline.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Please add support for a list of documents as input to all Pipeline processors #47

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Please add support for a list of documents as input to all Pipeline processors #47

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions