Skip to content

Commit 5f78b27

Browse files
committed
feat(multimodal): add audio/video support
Audio/Video Support: - Add AudioContentBlock and VideoContentBlock types - Implement OpenAI audio support (wav/mp3 via input_audio) - Implement Gemini audio/video native support - Add multimodal resolve callbacks (customTranscriber, customFrameExtractor) Provider Fixes: - Fix Anthropic file block to support base64/url - Fix OpenAI Responses API audio handling - Fix provider-env.ts to read OPENAI_API correctly Tests & Docs: - Add multimodal integration tests with error handling - Update skill management tests for new API - Update zh-CN/en multimodal documentation
1 parent f163078 commit 5f78b27

23 files changed

Lines changed: 1788 additions & 766 deletions

File tree

docs/en/guides/multimodal.md

Lines changed: 164 additions & 37 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Multimodal Content Guide
22

3-
KODE SDK supports multimodal input including images, audio, and files (PDF). This guide covers how to send multimodal content to LLM models and manage multimodal history.
3+
KODE SDK supports multimodal input including images, audio, video, and files (PDF). This guide covers how to send multimodal content to LLM models and manage multimodal history.
44

55
---
66

@@ -10,7 +10,8 @@ KODE SDK supports multimodal input including images, audio, and files (PDF). Thi
1010
|------|------------|---------------------|
1111
| Images | `image` | Anthropic, OpenAI, Gemini, GLM, Minimax |
1212
| PDF Files | `file` | Anthropic, OpenAI (Responses API), Gemini |
13-
| Audio | `audio` | OpenAI, Gemini |
13+
| Audio | `audio` | OpenAI (wav/mp3), Gemini |
14+
| Video | `video` | Gemini |
1415

1516
---
1617

@@ -65,6 +66,34 @@ const content: ContentBlock[] = [
6566
const response = await agent.send(content);
6667
```
6768

69+
### Audio Input
70+
71+
```typescript
72+
const audioBuffer = fs.readFileSync('./audio.wav');
73+
const base64 = audioBuffer.toString('base64');
74+
75+
const content: ContentBlock[] = [
76+
{ type: 'text', text: 'Please transcribe this audio.' },
77+
{ type: 'audio', base64, mime_type: 'audio/wav' }
78+
];
79+
80+
const response = await agent.send(content);
81+
```
82+
83+
### Video Input
84+
85+
```typescript
86+
const videoBuffer = fs.readFileSync('./video.mp4');
87+
const base64 = videoBuffer.toString('base64');
88+
89+
const content: ContentBlock[] = [
90+
{ type: 'text', text: 'Describe what is happening in this video.' },
91+
{ type: 'video', base64, mime_type: 'video/mp4' }
92+
];
93+
94+
const response = await agent.send(content);
95+
```
96+
6897
---
6998

7099
## Multimodal Configuration
@@ -93,9 +122,9 @@ const agent = await Agent.create({
93122
Configure multimodal options in the model configuration:
94123

95124
```typescript
96-
const provider = new AnthropicProvider(
97-
process.env.ANTHROPIC_API_KEY!,
98-
'claude-sonnet-4-20250514',
125+
const provider = new GeminiProvider(
126+
process.env.GEMINI_API_KEY!,
127+
'gemini-2.0-flash-exp',
99128
undefined, // baseUrl
100129
undefined, // proxyUrl
101130
{
@@ -105,9 +134,12 @@ const provider = new AnthropicProvider(
105134
allowMimeTypes: [ // Allowed MIME types
106135
'image/jpeg',
107136
'image/png',
108-
'image/gif',
109137
'image/webp',
110138
'application/pdf',
139+
'audio/wav',
140+
'audio/mp3',
141+
'video/mp4',
142+
'video/webm',
111143
],
112144
},
113145
}
@@ -139,21 +171,37 @@ const provider = new AnthropicProvider(
139171
|-----------|-----------|-------|
140172
| `application/pdf` | `.pdf` | Anthropic, OpenAI (Responses API), Gemini |
141173

174+
### Audio
175+
176+
| MIME Type | Extension | Notes |
177+
|-----------|-----------|-------|
178+
| `audio/wav` | `.wav` | OpenAI, Gemini |
179+
| `audio/mp3` | `.mp3` | OpenAI, Gemini |
180+
| `audio/mpeg` | `.mp3` | OpenAI, Gemini |
181+
| `audio/ogg` | `.ogg` | Gemini only |
182+
| `audio/flac` | `.flac` | Gemini only |
183+
184+
### Video
185+
186+
| MIME Type | Extension | Notes |
187+
|-----------|-----------|-------|
188+
| `video/mp4` | `.mp4` | Gemini only |
189+
| `video/webm` | `.webm` | Gemini only |
190+
| `video/quicktime` | `.mov` | Gemini only |
191+
142192
---
143193

144194
## Provider-Specific Notes
145195

146196
### Anthropic
147197

148198
- Supports images and PDF files
149-
- Use `files-api-2025-04-14` beta for file uploads
199+
- Files API beta header is automatically added when file blocks are detected
150200
- Base64 images embedded directly in messages
201+
- **Audio and video are not supported**
151202

152203
```typescript
153204
const provider = new AnthropicProvider(apiKey, model, baseUrl, proxyUrl, {
154-
beta: {
155-
filesApi: true, // Enable Files API
156-
},
157205
multimodal: {
158206
mode: 'url+base64',
159207
},
@@ -163,58 +211,98 @@ const provider = new AnthropicProvider(apiKey, model, baseUrl, proxyUrl, {
163211
### OpenAI
164212

165213
- Images: Supported in Chat Completions API
166-
- PDF/Files: Requires Responses API (`openaiApi: 'responses'`)
214+
- PDF/Files: Requires Responses API (`api: 'responses'`)
215+
- Audio: Supports wav/mp3 formats via Chat Completions API `input_audio` type
216+
- **Video is not supported** (use `customFrameExtractor` callback to extract frames as images)
167217

168218
```typescript
169219
const provider = new OpenAIProvider(apiKey, model, baseUrl, proxyUrl, {
170220
api: 'responses', // Required for PDF support
171221
multimodal: {
172222
mode: 'url+base64',
223+
allowMimeTypes: [
224+
'image/jpeg', 'image/png', 'image/webp',
225+
'audio/wav', 'audio/mp3',
226+
'application/pdf',
227+
],
173228
},
174229
});
175230
```
176231

177232
### Gemini
178233

179-
- Supports images and PDF files
234+
- Supports images, PDF, audio, and video
180235
- GIF format not supported
181-
- Use `mediaResolution` option for image quality
236+
- Audio and video natively supported without special configuration
182237

183238
```typescript
184239
const provider = new GeminiProvider(apiKey, model, baseUrl, proxyUrl, {
185-
mediaResolution: 'high', // 'low' | 'medium' | 'high'
186240
multimodal: {
187241
mode: 'url+base64',
242+
allowMimeTypes: [
243+
'image/jpeg', 'image/png', 'image/webp',
244+
'application/pdf',
245+
'audio/wav', 'audio/mp3', 'audio/ogg',
246+
'video/mp4', 'video/webm',
247+
],
248+
},
249+
});
250+
```
251+
252+
---
253+
254+
## Video Fallback Handling
255+
256+
For providers that don't support video (like OpenAI), you can configure `customFrameExtractor` callback to extract video frames as images:
257+
258+
```typescript
259+
const multimodalConfig = {
260+
mode: 'url+base64',
261+
maxBase64Bytes: 20_000_000,
262+
video: {
263+
// Extract key frames when provider doesn't support video
264+
customFrameExtractor: async (video: { base64?: string; url?: string; mimeType?: string }) => {
265+
// Use ffmpeg or other tools to extract key frames
266+
// Return array of images
267+
return [
268+
{ base64: '...', mimeType: 'image/jpeg' },
269+
{ base64: '...', mimeType: 'image/jpeg' },
270+
];
271+
},
188272
},
273+
};
274+
275+
const provider = new OpenAIProvider(apiKey, model, baseUrl, proxyUrl, {
276+
multimodal: multimodalConfig,
189277
});
190278
```
191279

192280
---
193281

194282
## Best Practices
195283

196-
### 1. Use Appropriate Image Sizes
284+
### 1. Use Appropriate File Sizes
197285

198-
Large images increase token usage and latency. Resize images before sending:
286+
Large files increase token usage and latency. Resize before sending:
199287

200288
```typescript
201-
// Recommendation: Keep images under 1MB for optimal performance
289+
// Recommendation: Keep files under 1MB for optimal performance
202290
const maxBytes = 1024 * 1024; // 1MB
203291

204-
function validateImageSize(base64: string): boolean {
292+
function validateFileSize(base64: string): boolean {
205293
const bytes = Math.ceil(base64.length * 3 / 4);
206294
return bytes <= maxBytes;
207295
}
208296
```
209297

210298
### 2. Handle Multimodal Context Retention
211299

212-
For long conversations with many images, configure retention to avoid context overflow:
300+
For long conversations with many multimedia files, configure retention to avoid context overflow:
213301

214302
```typescript
215303
const agent = await Agent.create({
216304
templateId: 'vision-assistant',
217-
multimodalRetention: { keepRecent: 2 }, // Keep only recent 2 images
305+
multimodalRetention: { keepRecent: 2 }, // Keep only recent 2 multimedia messages
218306
context: {
219307
maxTokens: 100_000,
220308
compressToTokens: 60_000,
@@ -227,19 +315,22 @@ const agent = await Agent.create({
227315
Always validate MIME types before sending:
228316

229317
```typescript
230-
const ALLOWED_IMAGE_TYPES = ['image/jpeg', 'image/png', 'image/webp'];
318+
const ALLOWED_TYPES: Record<string, string[]> = {
319+
image: ['image/jpeg', 'image/png', 'image/webp'],
320+
audio: ['audio/wav', 'audio/mp3', 'audio/mpeg'],
321+
video: ['video/mp4', 'video/webm'],
322+
};
231323

232-
function getImageMimeType(filename: string): string {
324+
function getMimeType(filename: string, category: 'image' | 'audio' | 'video'): string {
233325
const ext = filename.toLowerCase().split('.').pop();
234326
const mimeMap: Record<string, string> = {
235-
jpg: 'image/jpeg',
236-
jpeg: 'image/jpeg',
237-
png: 'image/png',
238-
webp: 'image/webp',
327+
jpg: 'image/jpeg', jpeg: 'image/jpeg', png: 'image/png', webp: 'image/webp',
328+
wav: 'audio/wav', mp3: 'audio/mp3',
329+
mp4: 'video/mp4', webm: 'video/webm',
239330
};
240331
const mimeType = mimeMap[ext!];
241-
if (!mimeType || !ALLOWED_IMAGE_TYPES.includes(mimeType)) {
242-
throw new Error(`Unsupported image type: ${ext}`);
332+
if (!mimeType || !ALLOWED_TYPES[category].includes(mimeType)) {
333+
throw new Error(`Unsupported ${category} type: ${ext}`);
243334
}
244335
return mimeType;
245336
}
@@ -254,22 +345,25 @@ Common multimodal errors:
254345
| Error | Cause | Solution |
255346
|-------|-------|----------|
256347
| `MultimodalValidationError: Base64 is not allowed` | `mode` set to `'url'` only | Set `mode: 'url+base64'` |
257-
| `MultimodalValidationError: base64 payload too large` | Exceeds `maxBase64Bytes` | Resize image or increase limit |
348+
| `MultimodalValidationError: base64 payload too large` | Exceeds `maxBase64Bytes` | Resize file or increase limit |
258349
| `MultimodalValidationError: mime_type not allowed` | MIME type not in allowlist | Add to `allowMimeTypes` |
259350
| `MultimodalValidationError: Missing url/file_id/base64` | No content source provided | Provide `url`, `file_id`, or `base64` |
351+
| `UnsupportedContentBlockError: Unsupported content block type: video` | Provider doesn't support video | Use Gemini or configure `customFrameExtractor` |
260352

261353
---
262354

263-
## Complete Example
355+
## Complete Examples
356+
357+
### Image Analysis Example
264358

265359
```typescript
266-
import { Agent, AnthropicProvider, JSONStore, ContentBlock } from '@shareai-lab/kode-sdk';
360+
import { Agent, GeminiProvider, JSONStore, ContentBlock } from '@shareai-lab/kode-sdk';
267361
import * as fs from 'fs';
268362

269363
async function analyzeImage() {
270-
const provider = new AnthropicProvider(
271-
process.env.ANTHROPIC_API_KEY!,
272-
'claude-sonnet-4-20250514',
364+
const provider = new GeminiProvider(
365+
process.env.GEMINI_API_KEY!,
366+
'gemini-2.0-flash-exp',
273367
undefined,
274368
undefined,
275369
{
@@ -294,7 +388,6 @@ async function analyzeImage() {
294388
modelFactory: () => provider,
295389
});
296390

297-
// Read and send image
298391
const imageBuffer = fs.readFileSync('./photo.jpg');
299392
const base64 = imageBuffer.toString('base64');
300393

@@ -303,14 +396,48 @@ async function analyzeImage() {
303396
{ type: 'image', base64, mime_type: 'image/jpeg' }
304397
];
305398

306-
for await (const envelope of agent.subscribe(['progress'])) {
399+
// Use chatStream for streaming responses
400+
for await (const envelope of agent.chatStream(content)) {
307401
if (envelope.event.type === 'text_chunk') {
308402
process.stdout.write(envelope.event.delta);
309403
}
310404
if (envelope.event.type === 'done') break;
311405
}
406+
}
407+
```
408+
409+
### Audio Transcription Example
410+
411+
```typescript
412+
async function transcribeAudio() {
413+
const audioBuffer = fs.readFileSync('./speech.wav');
414+
const base64 = audioBuffer.toString('base64');
415+
416+
const content: ContentBlock[] = [
417+
{ type: 'text', text: 'Please transcribe this audio and identify the speaker\'s emotion.' },
418+
{ type: 'audio', base64, mime_type: 'audio/wav' }
419+
];
420+
421+
const response = await agent.chat(content);
422+
console.log(response.text);
423+
}
424+
```
425+
426+
### Video Analysis Example
427+
428+
```typescript
429+
async function analyzeVideo() {
430+
const videoBuffer = fs.readFileSync('./clip.mp4');
431+
const base64 = videoBuffer.toString('base64');
432+
433+
const content: ContentBlock[] = [
434+
{ type: 'text', text: 'What is happening in this video? Please describe in detail.' },
435+
{ type: 'video', base64, mime_type: 'video/mp4' }
436+
];
312437

313-
await agent.send(content);
438+
// Note: Only Gemini supports video
439+
const response = await agent.chat(content);
440+
console.log(response.text);
314441
}
315442
```
316443

0 commit comments

Comments
 (0)