Skip to content

Commit 8a2314b

Browse files
authored
Merge pull request #30 from microsoft/08-multimodal-multimodel
readme
2 parents 38fda23 + 574e712 commit 8a2314b

1 file changed

Lines changed: 86 additions & 144 deletions

File tree

Lines changed: 86 additions & 144 deletions
Original file line numberDiff line numberDiff line change
@@ -1,190 +1,132 @@
1-
# Multimodal + Multimodel AI Demo
1+
# AI Decision Engine Explorer
22

3-
A practical demo showing how MULTIMODAL inputs and MULTIMODEL architecture work together. Upload a Japanese menu image with a question and watch three specialized AI models collaborate to answer it.
3+
**Explore how AI combines multimodal inputs and multimodel processing to solve real-world problems.** This app demonstrates how AI systems analyze diverse inputs like images, audio, text and select the best models for each task to deliver accurate results. For example, upload a photo of a menu and ask, "What are the vegetarian options?", the app will detect text in the image, translate it if needed, and reason through the content to provide an answer.
44

5-
## What This Demonstrates
5+
You can upload an image, audio, or text and pose a question to see how the app analyzes each input type individually, selects the best models for the task, and orchestrates them to provide a comprehensive and accurate response.
66

7-
This demo clarifies two concepts that sound similar but mean different things:
7+
## 🌟 Why This App Matters
88

9-
**MULTIMODAL** = Using different types of input data (image + text, audio + text, etc.)
9+
In the world of AI, the synergy between two key concepts—**Multimodal** and **Multimodel**—is transformative:
1010

11-
**MULTIMODEL** = Using multiple specialized AI models in sequence or parallel
11+
- **Multimodal**: By combining different input types (e.g., text, images, audio), AI systems can process information holistically, capturing the richness of real-world data. For example, the app can analyze an image of a menu and a text-based question together to provide a meaningful answer.
12+
- **Multimodel**: Leveraging multiple specialized AI models ensures that each task is handled by the most capable model, leading to more accurate and reliable outcomes. For instance, text recognition, translation, and reasoning are performed by different models, each optimized for its specific task.
1213

13-
Most AI applications need both concepts working together.
14+
When used together, these concepts transform how AI systems operate. By combining diverse data types and selecting the most suitable model for each task, AI can tackle complex, real-world scenarios with a nuanced understanding. This synergy creates systems that are not only more robust but also better equipped to handle the challenges of diverse and dynamic environments.
1415

15-
## Real Example: Japanese Menu Translation
16+
## 🚀 Quick Start
1617

17-
Upload a photo of a Japanese restaurant menu and ask "anything gluten free?"
18+
### 1. Requirements
1819

19-
The system:
20-
1. **Detects multimodal input** - Image (menu photo) + Text (your question)
21-
2. **Selects three specialized models**:
22-
- Llama 4 Maverick (Meta) - Extracts Japanese text from image
23-
- Cohere Command R+ - Translates Japanese to English
24-
- DeepSeek R1 - Answers your question based on translated menu
25-
3. **Chains outputs** - Each model passes results to the next one
20+
Ensure you have the following installed and ready:
2621

27-
This shows why both concepts matter: handling different input types (multimodal) AND using the right model for each task (multimodel).
22+
- **Node.js**: Version 18 or higher
23+
- **Credentials**: Either a GitHub Personal Access Token or Azure OpenAI credentials
2824

29-
## Architecture
25+
### 2. Installation
3026

31-
The app has two main parts:
27+
Follow these steps to set up the app:
3228

33-
**Frontend** (Next.js + TypeScript)
34-
- Upload component for images
35-
- Two visualization boxes showing MULTIMODAL and MULTIMODEL concepts
36-
- Pipeline display showing each model's output
29+
1. Clone the repository:
3730

38-
**Backend** (TypeScript + GitHub Models API)
39-
- Decision engine that analyzes input type
40-
- Model orchestration that chains three models
41-
- Streaming responses for real-time updates
31+
```bash
32+
git clone https://github.com/microsoft/doodle-to-code.git
33+
```
4234

43-
**Models Used**:
44-
- Llama 4 Maverick 17B (Meta) - Vision and OCR
45-
- Cohere Command R+ - Translation (100+ languages)
46-
- DeepSeek R1 - Reasoning and question answering
47-
- Phi-4 Multimodal (Microsoft) - Backup for vision tasks
35+
2. Navigate to the project directory:
4836

49-
All models accessed via GitHub Models API (free tier available).
37+
```bash
38+
cd doodle-to-code/08-multimodal-multimodel/sample
39+
```
5040

51-
## Setup
41+
3. Install dependencies:
5242

53-
### Requirements
43+
```bash
44+
npm install
45+
```
5446

55-
- Node.js 18+
56-
- GitHub Personal Access Token (for GitHub Models API)
47+
4. Copy the example environment file:
5748

58-
### Installation
49+
```bash
50+
cp .env.local.example .env.local
51+
```
5952

60-
1. Clone and install:
61-
```bash
62-
npm install
63-
```
64-
65-
2. Set up environment variables:
66-
```bash
67-
cp .env.local.example .env.local
68-
```
53+
### 3. Configuration
6954

70-
Edit `.env.local`:
71-
```
72-
GITHUB_TOKEN=your_github_token_here
73-
PHI4_MULTIMODAL_MODEL=Phi-4-multimodal-instruct
74-
LLAMA4_MAVERICK_MODEL=Llama-4-Maverick-17B-128E-Instruct-FP8
75-
COHERE_MODEL=Cohere-command-r-plus-08-2024
76-
DEEPSEEK_MODEL=DeepSeek-R1-0528
77-
PHI4_REASONING_MODEL=Phi-4-reasoning-plus
78-
```
79-
80-
3. Run the dev server:
81-
```bash
82-
npm run dev
83-
```
55+
Set up your environment variables based on your preferred model provider:
8456

85-
4. Open http://localhost:3001
57+
#### GitHub Models (Free Tier)
8658

87-
### Getting a GitHub Token
59+
Add the following to your `.env.local` file:
8860

89-
1. Go to https://github.com/settings/tokens
90-
2. Generate new token (classic)
91-
3. Select scope: `read:packages`
92-
4. Copy token to `.env.local`
93-
94-
## How It Works
61+
```env
62+
GITHUB_TOKEN=your_github_token
63+
MODEL_ENDPOINT=https://models.inference.ai.azure.com
64+
```
9565

96-
**Step 1: Input Detection**
97-
- System checks what input types you provided
98-
- Detects: Image, Text, or both
99-
- Displays results in MULTIMODAL box
66+
#### Azure OpenAI
10067

101-
**Step 2: Model Selection**
102-
- Based on input type, system chooses specialized models
103-
- For IMAGE_WITH_TEXT: Llama → Cohere → DeepSeek
104-
- Displays selected models in MULTIMODEL box
68+
Add the following to your `.env.local` file:
10569

106-
**Step 3: Pipeline Execution**
107-
- Models run sequentially:
108-
- Llama extracts Japanese text from image
109-
- Cohere translates Japanese to English
110-
- DeepSeek answers your question using translated menu
111-
- Each model only gets what it needs:
112-
- First model: Gets original image
113-
- Middle models: Get previous model's output
114-
- Last model: Gets previous output + your question
70+
```env
71+
AZURE_OPENAI_API_KEY=your_key
72+
AZURE_OPENAI_ENDPOINT=your_endpoint
73+
MODEL_ENDPOINT=your_endpoint
74+
```
11575

116-
**Step 4: Results**
117-
- See each model's individual output
118-
- Final answer appears at the bottom
119-
- Performance stats: time (~20s) and cost (~$0.01)
76+
#### Model Names (Required)
12077

121-
## Project Structure
78+
Specify the models to use in your `.env.local` file:
12279

80+
```env
81+
PHI4_MULTIMODAL_MODEL=Phi-4-multimodal-instruct
82+
LLAMA4_MAVERICK_MODEL=Llama-4-Maverick-17B-128E-Instruct-FP8
83+
COHERE_MODEL=Cohere-command-r-plus-08-2024
84+
DEEPSEEK_MODEL=DeepSeek-R1-0528
85+
PHI_REASONING_MODEL=Phi-4-reasoning
12386
```
124-
├── src/
125-
│ ├── app/
126-
│ │ ├── api/upload/stream/ # Streaming upload endpoint
127-
│ │ └── page.tsx # Main demo page
128-
│ ├── components/
129-
│ │ ├── DecisionViewer.tsx # Shows MULTIMODAL + MULTIMODEL
130-
│ │ └── UploadComponent.tsx # File upload UI
131-
│ ├── lib/
132-
│ │ └── decision-engine.ts # Model selection and pipeline logic
133-
│ └── types/
134-
│ └── index.ts # TypeScript types
135-
├── .env.local # API keys (not in git)
136-
└── VIDEO_RECORDING_GUIDE.md # Guide for recording demo
137-
```
138-
139-
## Key Files
14087

141-
**decision-engine.ts** - Core logic
142-
- `analyzeStreaming()` - Detects input modality
143-
- `selectModels()` - Chooses which models to use
144-
- `executeModel()` - Runs individual models
145-
- Pipeline flow at lines 920-935
88+
### 4. Run the App
14689

147-
**DecisionViewer.tsx** - UI components
148-
- MULTIMODAL box (shows detected inputs)
149-
- MULTIMODEL box (shows selected models)
150-
- Pipeline display (shows execution flow)
90+
Start the development server:
15191

152-
## Performance
153-
154-
Typical request:
155-
- **Time**: ~20 seconds
156-
- **Cost**: < $0.01 (using GitHub Models free tier)
157-
- **Models**: 3 different providers (Meta, Cohere, DeepSeek)
158-
- **Token usage**: ~750 tokens total
92+
```bash
93+
npm run dev
94+
```
15995

160-
## Why This Approach?
96+
Access the app in your browser at:
16197

162-
**Specialization** - Each model does what it's best at:
163-
- Vision models for OCR
164-
- Translation models for languages
165-
- Reasoning models for question answering
98+
[http://localhost:3001](http://localhost:3001)
16699

167-
**Transparency** - You see exactly what each model contributes
100+
## 🧠 How It Works
168101

169-
**Cost efficiency** - Only pay for what you need (each model runs once)
102+
The app employs a tiered router-based architecture to handle multimodal and multimodel tasks:
170103

171-
**Flexibility** - Easy to swap models or add new ones
104+
1. **Decision Engine**: Detects input types, decomposes tasks, selects models, and orchestrates their execution. It ensures that the right models are chosen for each task, optimizing for accuracy and efficiency.
105+
2. **Decision Viewer**: Visualizes the decision-making process, showing how inputs are processed and models interact. This transparency helps users understand the reasoning behind each decision.
106+
3. **API Endpoint**: Manages uploads and streams results back to the client in real-time, providing a seamless user experience.
172107

173-
## Limitations
108+
## 📂 Project Structure
174109

175-
- Image size: Keep under 500KB for fast processing
176-
- Language support: Currently optimized for Japanese menus
177-
- Model availability: Requires GitHub Models API access
178-
- Temperature settings: Fixed at 0.3-0.7 (not configurable in UI)
179-
- Token limits: 250 tokens per model (may truncate long outputs)
110+
The project is organized as follows:
180111

181-
## Learn More
112+
```plaintext
113+
src/
114+
├── app/
115+
│ ├── api/upload/stream/ # Streaming API that orchestrates execution
116+
│ └── page.tsx # Main UI with upload + visualizations
117+
├── components/
118+
│ ├── DecisionViewer.tsx # Renders multimodal/multimodel boxes + pipeline
119+
│ └── UploadComponent.tsx # Handles file upload + streaming events
120+
└── lib/
121+
├── decision-engine.ts # Core router logic (1041 lines of orchestration)
122+
├── config.ts # Environment configuration
123+
└── types/ # TypeScript definitions
124+
```
182125

183-
- [GitHub Models](https://github.com/marketplace/models) - Free AI model access
184-
- [Llama 4 Maverick](https://ai.meta.com/llama/) - Vision model from Meta
185-
- [Cohere Command R+](https://cohere.com/command) - Translation model
186-
- [DeepSeek R1](https://www.deepseek.com/) - Reasoning model
126+
## 📖 Learn More
187127

188-
## License
128+
Explore these resources to deepen your understanding:
189129

190-
MIT
130+
- [GitHub Models](https://github.com/marketplace/models) — Multi-provider AI model API
131+
- [Azure OpenAI Service](https://azure.microsoft.com/products/ai-services/openai-service) — Enterprise AI platform
132+
- [Router Pattern](https://www.anthropic.com/research/building-effective-agents) — Architectural pattern for AI systems

0 commit comments

Comments
 (0)