Calorie Prediction (Multimodal Model)

This project implements a multimodal neural network for predicting the total calorie content of dishes from both images and ingredient lists.

The model combines:

DistilBERT as a text encoder for ingredient descriptions.
ConvNeXt-Tiny as an image encoder (pretrained on ImageNet).
A fusion MLP regressor that integrates text, image, and dish mass into a single prediction.

Dataset

The dataset consists of:

ingredients.csv: ingredient IDs and their names.
dish.csv: dish IDs, ingredient lists, dish mass, total calories, and train/test split labels.
images/: photo of each dish (dish_id/rgb.png).

The target variable is the total calories per dish.

Project structure

data/                # test, train, val, dish, ingredients
models/              # saved best weights
src/                 # Python scripts
solution.ipynb       # Jupyter notebook with learning and evaluation

Training results

Epoch	Train Loss	Train MAE	Val Loss	Val MAE	R²
1	52487.58	166.85	22876.51	107.38	0.512
2	17341.38	92.02	14321.95	79.56	0.695
3	11470.07	74.06	10737.25	70.32	0.771
4	9188.95	65.79	9822.49	64.71	0.791
5	7020.94	57.87	8103.12	60.30	0.827
7	5832.44	53.39	7245.34	55.27	0.845
9	3836.40	43.75	6097.49	50.52	0.870
12	3205.79	40.24	5385.90	47.52	0.885
14	2893.54	37.94	5271.95	46.76	0.888
15	2615.69	35.78	5568.53	48.29	0.881

Final evaluation

Metric	Score
Test MAE	56.90
Test RMSE	82.93
Test R²	0.847

Evaluation performed on 507 test samples after 15 training epochs.

Resume

The multimodal model successfully learned to integrate visual information (images), semantic information (ingredients), and numerical features (dish mass) to predict dish calories.

Key takeaways:

The model achieved MAE < 60 kcal on the test set, reaching 47.5 kcal at validation set.
Final test MAE was 56.9 kcal, with a strong R² of 0.85, showing good generalization.
Adding dish mass as a feature significantly improved performance, confirming its strong correlation with calorie content.
The experiment demonstrates the effectiveness of combining pretrained transformers and CNNs for real-world multimodal regression tasks.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
data		data
model		model
src		src
.DS_Store		.DS_Store
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
solution.ipynb		solution.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Calorie Prediction (Multimodal Model)

Dataset

Project structure

Training results

Final evaluation

Resume

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Calorie Prediction (Multimodal Model)

Dataset

Project structure

Training results

Final evaluation

Resume

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages