Support for sentence splitting

Right now `TranslationModel.translate` will translate each input string as is, which can be extremely slow for longer sequences due to the quadratic runtime of the architecture. The current recommended way is to use nltk:

```
import nltk

nltk.load("punkt")

text = "Mr. Smith went to his favorite cafe. There, he met his friend Dr. Doe."
sents = nltk.tokenize.sent_tokenize(text, "english")  # don't use dlt.lang.ENGLISH
" ".join(model.translate(sents, source=dlt.lang.ENGLISH, target=dlt.lang.FRENCH))
```

Which works well but doesn't include all possible languages. It would be interesting to train the punkt model on each of the language made available (though we'd need to use a very large dataset for that). Once that's done, the snippet above could be a simple argument, e.g. `model.translate(..., max_length="sentence")`. With some more effort, `max_length` parameter could also be an integer `n` between 0 and 512, which represents the length of the max token. Moreover, rather than truncating at that length, we could break down the input text into sequences of length `n` or less, which would include the aggregated sentences.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support for sentence splitting #8

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Support for sentence splitting #8

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions