-
-
Notifications
You must be signed in to change notification settings - Fork 62
Open
Description
Problem
We're currently using fasttext for language identification.
This is useful especially to detect the language of an ingredient list extracted automatically using a ML model, or added by a contributor.
However, fasttext was trained on data that is quite different from ingredient lists (Wikipedia, Tatoeba and SETimes).
Sometimes the model fails for obvious cases, such as this one (french ingredient list):
text: fraise (12%), framboise (10%)
predictions:
en, confidence=0.4291181
it, confidence=0.13040087
fr, confidence=0.0435654
ro, confidence=0.026255628
no, confidence=0.019594753
de, confidence=0.017750196
es, confidence=0.01671417
tr, confidence=0.015862297
sco, confidence=0.01577331
ms, confidence=0.015433003
This behaviour is mostly present for short ingredient lists.
We should explore training a new model for language identification using Open Food Facts data (especially ingredient lists).
Requirements
Using fasttext is not a requirement. We can either train a new fasttext model, or train it with pytorch/tensorflow and export it to ONNX format.
Metadata
Metadata
Assignees
Type
Projects
Status
To triage