Train a language identifier model that works well on ingredient lists

### Problem

We're currently using [fasttext for language identification](https://fasttext.cc/docs/en/language-identification.html).
This is useful especially to detect the language of an ingredient list extracted automatically using a ML model, or added by a contributor.

However, fasttext was trained on data that is quite different from ingredient lists ([Wikipedia](https://www.wikipedia.org/), [Tatoeba](https://tatoeba.org/eng/) and [SETimes](http://nlp.ffzg.hr/resources/corpora/setimes/)).

Sometimes the model fails for obvious cases, such as this one (french ingredient list):

```
text: fraise (12%), framboise (10%)

predictions:
en, confidence=0.4291181
it, confidence=0.13040087
fr, confidence=0.0435654
ro, confidence=0.026255628
no, confidence=0.019594753
de, confidence=0.017750196
es, confidence=0.01671417
tr, confidence=0.015862297
sco, confidence=0.01577331
ms, confidence=0.015433003
```
This behaviour is mostly present for short ingredient lists.

We should explore training a new model for language identification using Open Food Facts data (especially ingredient lists).

## Requirements

Using fasttext is not a requirement. We can either train a new fasttext model, or train it with pytorch/tensorflow and export it to ONNX format.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Train a language identifier model that works well on ingredient lists #349

Problem

Requirements

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Train a language identifier model that works well on ingredient lists #349

Description

Problem

Requirements

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions