This project implements a multimodal neural network for predicting the total calorie content of dishes from both images and ingredient lists.
The model combines:
- DistilBERT as a text encoder for ingredient descriptions.
- ConvNeXt-Tiny as an image encoder (pretrained on ImageNet).
- A fusion MLP regressor that integrates text, image, and dish mass into a single prediction.
The dataset consists of:
ingredients.csv: ingredient IDs and their names.dish.csv: dish IDs, ingredient lists, dish mass, total calories, and train/test split labels.images/: photo of each dish (dish_id/rgb.png).
The target variable is the total calories per dish.
data/ # test, train, val, dish, ingredients
models/ # saved best weights
src/ # Python scripts
solution.ipynb # Jupyter notebook with learning and evaluation
| Epoch | Train Loss | Train MAE | Val Loss | Val MAE | R² |
|---|---|---|---|---|---|
| 1 | 52487.58 | 166.85 | 22876.51 | 107.38 | 0.512 |
| 2 | 17341.38 | 92.02 | 14321.95 | 79.56 | 0.695 |
| 3 | 11470.07 | 74.06 | 10737.25 | 70.32 | 0.771 |
| 4 | 9188.95 | 65.79 | 9822.49 | 64.71 | 0.791 |
| 5 | 7020.94 | 57.87 | 8103.12 | 60.30 | 0.827 |
| 7 | 5832.44 | 53.39 | 7245.34 | 55.27 | 0.845 |
| 9 | 3836.40 | 43.75 | 6097.49 | 50.52 | 0.870 |
| 12 | 3205.79 | 40.24 | 5385.90 | 47.52 | 0.885 |
| 14 | 2893.54 | 37.94 | 5271.95 | 46.76 | 0.888 |
| 15 | 2615.69 | 35.78 | 5568.53 | 48.29 | 0.881 |
| Metric | Score |
|---|---|
| Test MAE | 56.90 |
| Test RMSE | 82.93 |
| Test R² | 0.847 |
Evaluation performed on 507 test samples after 15 training epochs.
The multimodal model successfully learned to integrate visual information (images), semantic information (ingredients), and numerical features (dish mass) to predict dish calories.
Key takeaways:
- The model achieved MAE < 60 kcal on the test set, reaching 47.5 kcal at validation set.
- Final test MAE was 56.9 kcal, with a strong R² of 0.85, showing good generalization.
- Adding dish mass as a feature significantly improved performance, confirming its strong correlation with calorie content.
- The experiment demonstrates the effectiveness of combining pretrained transformers and CNNs for real-world multimodal regression tasks.
