Skip to content

Please add support for a list of documents as input to all Pipeline processors #47

@sharpsy

Description

@sharpsy

For the following code:

import classla
nlp = classla.Pipeline(lang='sl', processors="tokenize,ner,pos,lemma")
nlp(["Geografsko leži Slovenija v srednji Evropi oziroma v jugovzhodni Evropi",
       "V severozahodnem delu države prevladujejo Alpe"])

classla.Pipeline raises an error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.10/site-packages/classla/pipeline/core.py", line 168, in __call__
    doc = self.process(doc)
  File "/usr/local/lib/python3.10/site-packages/classla/pipeline/core.py", line 162, in process
    doc = self.processors[processor_name].process(doc)
  File "/usr/local/lib/python3.10/site-packages/classla/pipeline/tokenize_processor.py", line 74, in process
    assert isinstance(document, str) or (self.config.get('pretokenized') or self.config.get('no_ssplit', False)), \
AssertionError: If neither 'pretokenized' or 'no_ssplit' option is enabled, the input to the TokenizerProcessor must be a string.

It would be nice if Pipeline supported the above use case and take a list of documents as input. I believe there is a partial support for batched inputs in classla (some processors support batch of documents that are processed in parallel), but with the AssertionError from above we are often forced to run our data one sentence at a time.

Please note that even if TokenizerProcessor cannot be implemented to process documents in parallel - it can still run all documents in a for-loop. Although it wouldn't speed up the tokenization part, some later processor in the pipeline can benefit from batched input (ie. by running on a GPU) and do its part much faster, thus speeding-up the whole Pipeline.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions