Skip to content

Commit 2d68c54

Browse files
committed
Add 08 content
1 parent 8a2314b commit 2d68c54

5 files changed

Lines changed: 109 additions & 1 deletion

File tree

08-multimodal-multimodel/README.md

Lines changed: 107 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1,107 @@
1-
Doodle summary
1+
# 🧠 Multimodal and Multimodel in AI
2+
3+
Let's learn about two concepts you'll hear a lot in modern AI — **multimodal** and **multi-model**.
4+
They sound almost the same, but they work in very different ways.
5+
6+
![Multimodal and Multi-model AI](../images/multimodal-multimodel.png)
7+
8+
## 🧿 What Is Multimodality?
9+
10+
**Multimodality** means an AI model can understand and generate *multiple types of data* and combine them to reason about the world.
11+
12+
A **multimodal model** is a *single AI model* that works across:
13+
14+
- 📝 Text
15+
- 🖼️ Images
16+
- 🔊 Audio
17+
- 🎥 Video
18+
- 💻 Code
19+
etc.
20+
21+
Think of it as **one brain with multiple senses** — all working together.
22+
The model can read an image, interpret the text inside it, listen to accompanying audio, and respond in natural language, all in one flow.
23+
24+
### When multimodal model approaches shine
25+
- Simple, end-to-end tasks
26+
- Scenarios where reasoning across multiple data types matters
27+
- Apps where you want **one model** and minimal engineering overhead
28+
- Fast prototyping or lightweight workflows
29+
30+
31+
## 👯 What Is the Multi-model Approach?
32+
33+
A **multimodel** system uses **multiple specialized AI models**, each designed for a specific task.
34+
35+
Examples:
36+
- 👁️ A vision model for image understanding
37+
- 🌐 A translation model for languages
38+
- 💬 A large language model for reasoning
39+
- 🧩 A classifier or embedding model for structured tasks
40+
41+
Your application becomes the **orchestrator**, passing outputs from one model to another as a workflow.
42+
43+
This is like having a **team of experts**, each doing what they're best at.
44+
45+
### When multi-model systems shine
46+
- High-accuracy, domain-specific requirements
47+
- Workflows that need fine control at each step
48+
- Combining best-in-class models for each modality
49+
- Large-scale pipelines where cost efficiency matters
50+
51+
## ⚖️ Which Approach Should You Choose?
52+
53+
**Multimodal**
54+
- ✔ Fewer moving parts
55+
- ✔ Easy to build with
56+
- ✔ Great for general use
57+
- ❌ Can be more expensive per inference
58+
- ❌ Not always the best for specialized tasks
59+
60+
**Multimodel**
61+
- ✔ Higher accuracy through specialization
62+
- ✔ More cost-efficient at scale
63+
- ✔ Fine-grained control
64+
- ❌ Requires more engineering
65+
- ❌ More points of failure
66+
67+
## 🧜‍♀️ Hybrid Approaches (Often the Sweet Spot)
68+
69+
In many real applications, you'll mix both:
70+
71+
- Use **specialized models** for tasks like OCR or transcription
72+
- Use a **multimodal model** or LLM on top to reason and produce a final answer
73+
74+
This gives you a balance of accuracy, cost efficiency, and flexibility.
75+
76+
77+
## 🧪 Example in the video: Multi-model Approach
78+
79+
**Scenario:**
80+
You're traveling in Japan, sitting at a restaurant for lunch.
81+
You take a photo of the menu and ask your app:
82+
83+
> "Can you suggest gluten-free meals from this menu?"
84+
85+
![Diner in Japan](../images/japan-diner.png)
86+
87+
**How the app handles it:**
88+
89+
The app orchestrates three specialized models, each doing what it's best at:
90+
91+
1. **OCR Model** extracts the text from the menu image — including Japanese characters, prices, dish names, and descriptions.
92+
1. **Translation Model** translates the extracted text into English (or the user's preferred language) with high linguistic accuracy.
93+
1. **LLM for Reasoning** analyzes the translated menu, identifies ingredients, checks for gluten-containing items, and returns a clear recommendation of safe dishes.
94+
95+
---
96+
97+
## 🚀 Try the Example App
98+
99+
**Now [Try the Example App by Yourself](sample/README.md)!**
100+
101+
## 📺 Watch on YouTube - Will be available soon!
102+
103+
Watch the video, **Multimodal and Multi-model AI** on YouTube:
104+
105+
[![YouTube: Multimodal and Multi-model AI](https://img.youtube.com/vi/0000/0.jpg)](https://www.youtube.com/watch?v=0000)
106+
107+
[Subscribe us!](https://www.youtube.com/channel/UCV_6HOhwxYLXAGd-JOqKPoQ?sub_confirmation=1)

08-multimodal-multimodel/sample/README.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -97,6 +97,8 @@ Access the app in your browser at:
9797

9898
[http://localhost:3001](http://localhost:3001)
9999

100+
Download [this Japanese diner menu](../../images/contoso-lunch.png) and try with the app!
101+
100102
## 🧠 How It Works
101103

102104
The app employs a tiered router-based architecture to handle multimodal and multimodel tasks:

images/contoso-lunch.png

2.25 MB
Loading

images/japan-diner.png

4 MB
Loading

images/multimodal-multimodel.png

4.28 MB
Loading

0 commit comments

Comments
 (0)