You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: 08-multimodal-multimodel/README.md
+22-18Lines changed: 22 additions & 18 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -3,17 +3,17 @@
3
3
Let's learn about two concepts you'll hear a lot in modern AI — **multimodal** and **multi-model**.
4
4
They sound almost the same, so let's not get confused!
5
5
6
-

6
+

7
7
8
8
## 🧿 Multimodal
9
9
10
10
### What Is Multimodality?
11
11
12
-
**Multimodality** means an AI model can understand and generate *multiple types of data* and combine them to reason about the world.
12
+
**Multimodality** means an AI system can understand and generate *multiple types of data* and combine them to reason about the world.
13
13
14
-
### Multimodal model
14
+
### Multimodal Model
15
15
16
-
When you hear a **multimodal model**, it means a *single AI model* that works across:
16
+
A **multimodal model** is a *single AI model* that works across:
17
17
18
18
- 📝 Text
19
19
- 🖼️ Images
@@ -22,8 +22,7 @@ When you hear a **multimodal model**, it means a *single AI model* that works ac
22
22
- 💻 Code
23
23
etc.
24
24
25
-
Think of it as **one brain with multiple senses** — all working together.
26
-
The model can read an image, interpret the text inside it, listen to accompanying audio, and respond in natural language, all in one flow.
25
+
Think of it as **one brain with multiple senses** — all working together. The model can read an image, interpret the text inside it, listen to accompanying audio, and respond in natural language, all in one flow.
27
26
28
27
### When multimodal model approaches shine
29
28
- Simple, end-to-end tasks
@@ -33,9 +32,9 @@ The model can read an image, interpret the text inside it, listen to accompanyin
33
32
34
33
## 👯 Multi-model
35
34
36
-
## What Is the Multi-model Approach?
35
+
###What Is the Multi-model Approach?
37
36
38
-
A **multimodel** system uses **multiple specialized AI models**, each designed for a specific task.
37
+
A **multimodel** system uses *multiple specialized AI models*, each designed for a specific task.
39
38
40
39
Examples:
41
40
- 👁️ A vision model for image understanding
@@ -69,7 +68,7 @@ This is like having a **team of experts**, each doing what they're best at.
69
68
- ❌ Requires more engineering
70
69
- ❌ More points of failure
71
70
72
-
## 🧜♀️ Hybrid Approaches (Often the Sweet Spot)
71
+
**Hybrid Approaches** (Often the Sweet Spot 🧜♀️ )
73
72
74
73
In many real applications, you'll mix both:
75
74
@@ -79,19 +78,23 @@ In many real applications, you'll mix both:
79
78
This gives you a balance of accuracy, cost efficiency, and flexibility.
80
79
81
80
82
-
## 🧪 Example in the video: Multi-model Approach
81
+
## 🧪 Example in the video
83
82
84
-
**Scenario:**
85
-
You're traveling in Japan, sitting at a restaurant for lunch.
86
-
You take a photo of the menu and ask your app:
83
+
**👩 User scenario:**
87
84
88
-
> "Can you suggest gluten-free meals from this menu?"
85
+
A user is traveling. In this case, Japan, and sitting at a local diner for lunch. They don't have a munu in English, so the user takes a photo of the menu, uploads it to the AI-powered app, then asks:
86
+
87
+
> 👩 "Can you suggest gluten-free meals from this menu?"
89
88
90
89

91
90
92
-
**How the app handles it:**
91
+
The app suggests *Yasai-itame teishoku* (stir-fried vegetable set) from the menu.
92
+
93
+
**📱 App scenario:**
94
+
95
+
The app needs to handle multimodality.
93
96
94
-
The app orchestrates three specialized models, each doing what it's best at:
97
+
In this case, the app orchestrates three specialized models, each doing what it's best at:
95
98
96
99
1.**OCR Model** extracts the text from the menu image — including Japanese characters, prices, dish names, and descriptions.
97
100
1.**Translation Model** translates the extracted text into English (or the user's preferred language) with high linguistic accuracy.
@@ -107,6 +110,7 @@ The app orchestrates three specialized models, each doing what it's best at:
107
110
108
111
Watch the video, **Multimodal and Multi-model AI** on YouTube:
109
112
110
-
[]([https://www.youtube.com/watch?v=zkZYeYvBy60](https://www.youtube.com/watch?v=zkZYeYvBy60))
113
+
[](https://www.youtube.com/watch?v=zkZYeYvBy60)
0 commit comments