|
1 | | -Doodle summary |
| 1 | +# 🧠 Multimodal and Multimodel in AI |
| 2 | + |
| 3 | +Let's learn about two concepts you'll hear a lot in modern AI — **multimodal** and **multi-model**. |
| 4 | +They sound almost the same, but they work in very different ways. |
| 5 | + |
| 6 | + |
| 7 | + |
| 8 | +## 🧿 What Is Multimodality? |
| 9 | + |
| 10 | +**Multimodality** means an AI model can understand and generate *multiple types of data* and combine them to reason about the world. |
| 11 | + |
| 12 | +A **multimodal model** is a *single AI model* that works across: |
| 13 | + |
| 14 | +- 📝 Text |
| 15 | +- 🖼️ Images |
| 16 | +- 🔊 Audio |
| 17 | +- 🎥 Video |
| 18 | +- 💻 Code |
| 19 | +etc. |
| 20 | + |
| 21 | +Think of it as **one brain with multiple senses** — all working together. |
| 22 | +The model can read an image, interpret the text inside it, listen to accompanying audio, and respond in natural language, all in one flow. |
| 23 | + |
| 24 | +### When multimodal model approaches shine |
| 25 | +- Simple, end-to-end tasks |
| 26 | +- Scenarios where reasoning across multiple data types matters |
| 27 | +- Apps where you want **one model** and minimal engineering overhead |
| 28 | +- Fast prototyping or lightweight workflows |
| 29 | + |
| 30 | + |
| 31 | +## 👯 What Is the Multi-model Approach? |
| 32 | + |
| 33 | +A **multimodel** system uses **multiple specialized AI models**, each designed for a specific task. |
| 34 | + |
| 35 | +Examples: |
| 36 | +- 👁️ A vision model for image understanding |
| 37 | +- 🌐 A translation model for languages |
| 38 | +- 💬 A large language model for reasoning |
| 39 | +- 🧩 A classifier or embedding model for structured tasks |
| 40 | + |
| 41 | +Your application becomes the **orchestrator**, passing outputs from one model to another as a workflow. |
| 42 | + |
| 43 | +This is like having a **team of experts**, each doing what they're best at. |
| 44 | + |
| 45 | +### When multi-model systems shine |
| 46 | +- High-accuracy, domain-specific requirements |
| 47 | +- Workflows that need fine control at each step |
| 48 | +- Combining best-in-class models for each modality |
| 49 | +- Large-scale pipelines where cost efficiency matters |
| 50 | + |
| 51 | +## ⚖️ Which Approach Should You Choose? |
| 52 | + |
| 53 | +**Multimodal** |
| 54 | +- ✔ Fewer moving parts |
| 55 | +- ✔ Easy to build with |
| 56 | +- ✔ Great for general use |
| 57 | +- ❌ Can be more expensive per inference |
| 58 | +- ❌ Not always the best for specialized tasks |
| 59 | + |
| 60 | +**Multimodel** |
| 61 | +- ✔ Higher accuracy through specialization |
| 62 | +- ✔ More cost-efficient at scale |
| 63 | +- ✔ Fine-grained control |
| 64 | +- ❌ Requires more engineering |
| 65 | +- ❌ More points of failure |
| 66 | + |
| 67 | +## 🧜♀️ Hybrid Approaches (Often the Sweet Spot) |
| 68 | + |
| 69 | +In many real applications, you'll mix both: |
| 70 | + |
| 71 | +- Use **specialized models** for tasks like OCR or transcription |
| 72 | +- Use a **multimodal model** or LLM on top to reason and produce a final answer |
| 73 | + |
| 74 | +This gives you a balance of accuracy, cost efficiency, and flexibility. |
| 75 | + |
| 76 | + |
| 77 | +## 🧪 Example in the video: Multi-model Approach |
| 78 | + |
| 79 | +**Scenario:** |
| 80 | +You're traveling in Japan, sitting at a restaurant for lunch. |
| 81 | +You take a photo of the menu and ask your app: |
| 82 | + |
| 83 | +> "Can you suggest gluten-free meals from this menu?" |
| 84 | +
|
| 85 | + |
| 86 | + |
| 87 | +**How the app handles it:** |
| 88 | + |
| 89 | +The app orchestrates three specialized models, each doing what it's best at: |
| 90 | + |
| 91 | +1. **OCR Model** extracts the text from the menu image — including Japanese characters, prices, dish names, and descriptions. |
| 92 | +1. **Translation Model** translates the extracted text into English (or the user's preferred language) with high linguistic accuracy. |
| 93 | +1. **LLM for Reasoning** analyzes the translated menu, identifies ingredients, checks for gluten-containing items, and returns a clear recommendation of safe dishes. |
| 94 | + |
| 95 | +--- |
| 96 | + |
| 97 | +## 🚀 Try the Example App |
| 98 | + |
| 99 | +**Now [Try the Example App by Yourself](sample/README.md)!** |
| 100 | + |
| 101 | +## 📺 Watch on YouTube - Will be available soon! |
| 102 | + |
| 103 | +Watch the video, **Multimodal and Multi-model AI** on YouTube: |
| 104 | + |
| 105 | +[](https://www.youtube.com/watch?v=0000) |
| 106 | + |
| 107 | +[Subscribe us!](https://www.youtube.com/channel/UCV_6HOhwxYLXAGd-JOqKPoQ?sub_confirmation=1) |
0 commit comments