Skip to content

Train a language identifier model that works well on ingredient lists #349

@raphael0202

Description

@raphael0202

Problem

We're currently using fasttext for language identification.
This is useful especially to detect the language of an ingredient list extracted automatically using a ML model, or added by a contributor.

However, fasttext was trained on data that is quite different from ingredient lists (Wikipedia, Tatoeba and SETimes).

Sometimes the model fails for obvious cases, such as this one (french ingredient list):

text: fraise (12%), framboise (10%)

predictions:
en, confidence=0.4291181
it, confidence=0.13040087
fr, confidence=0.0435654
ro, confidence=0.026255628
no, confidence=0.019594753
de, confidence=0.017750196
es, confidence=0.01671417
tr, confidence=0.015862297
sco, confidence=0.01577331
ms, confidence=0.015433003

This behaviour is mostly present for short ingredient lists.

We should explore training a new model for language identification using Open Food Facts data (especially ingredient lists).

Requirements

Using fasttext is not a requirement. We can either train a new fasttext model, or train it with pytorch/tensorflow and export it to ONNX format.

Metadata

Metadata

Assignees

Projects

Status

To triage

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions