Skip to content

All preprocessing functions to receive as input TokenSeries #145

@jbesomi

Description

@jbesomi

The aim of this issue is to discuss and understand when tokenize should happen in the pipeline.

The current solution is to apply tokenize once the text has already been cleaned, either with clean or with a custom pipeline. In general, in the cleaning phase, we also remove the punctuation symbols.

The problem with this approach is that, especially for non-Western languages (#18 and #128), the tokenization operation might actually need the punctuation to execute correctly.

The natural question is: wouldn't be better to have as very first operation tokenize?

In this scenario, all preprocessing functions would receive as input a TokenSeries. As we care about performance, one question is whereas we can develop a remove_punctuation enough efficient with TokenSeries. The current version of the tokenize function is quite efficient as it makes use of regex. The first task would be to develop the new variant and benchmark it against the current one. An advantage of the non-regex approach is that as the input is a list of lists, we might empower parallelization.

Could we move tokenize at the very first step yet keeping performance high? Which solution offer the fastest performance?

The other question is: is there a scenario where preprocessing functions should deal with TextSeries rather than TokenSeries?


Extra crunch:

The current tokenize version uses a very naive approach based on regex that works only for Western languages. The main advantage is that it's quite fast compared to NLTK or other solutions. An alternative we should seriously consider is to replace the regex version with the SpaCy tokenizer (#131). The question is: how can we tokenize with SpaCy in a very efficient fashion?

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions