11# Multimodal Content Guide
22
3- KODE SDK supports multimodal input including images, audio, and files (PDF). This guide covers how to send multimodal content to LLM models and manage multimodal history.
3+ KODE SDK supports multimodal input including images, audio, video, and files (PDF). This guide covers how to send multimodal content to LLM models and manage multimodal history.
44
55---
66
@@ -10,7 +10,8 @@ KODE SDK supports multimodal input including images, audio, and files (PDF). Thi
1010| ------| ------------| ---------------------|
1111| Images | ` image ` | Anthropic, OpenAI, Gemini, GLM, Minimax |
1212| PDF Files | ` file ` | Anthropic, OpenAI (Responses API), Gemini |
13- | Audio | ` audio ` | OpenAI, Gemini |
13+ | Audio | ` audio ` | OpenAI (wav/mp3), Gemini |
14+ | Video | ` video ` | Gemini |
1415
1516---
1617
@@ -65,6 +66,34 @@ const content: ContentBlock[] = [
6566const response = await agent .send (content );
6667```
6768
69+ ### Audio Input
70+
71+ ``` typescript
72+ const audioBuffer = fs .readFileSync (' ./audio.wav' );
73+ const base64 = audioBuffer .toString (' base64' );
74+
75+ const content: ContentBlock [] = [
76+ { type: ' text' , text: ' Please transcribe this audio.' },
77+ { type: ' audio' , base64 , mime_type: ' audio/wav' }
78+ ];
79+
80+ const response = await agent .send (content );
81+ ```
82+
83+ ### Video Input
84+
85+ ``` typescript
86+ const videoBuffer = fs .readFileSync (' ./video.mp4' );
87+ const base64 = videoBuffer .toString (' base64' );
88+
89+ const content: ContentBlock [] = [
90+ { type: ' text' , text: ' Describe what is happening in this video.' },
91+ { type: ' video' , base64 , mime_type: ' video/mp4' }
92+ ];
93+
94+ const response = await agent .send (content );
95+ ```
96+
6897---
6998
7099## Multimodal Configuration
@@ -93,9 +122,9 @@ const agent = await Agent.create({
93122Configure multimodal options in the model configuration:
94123
95124``` typescript
96- const provider = new AnthropicProvider (
97- process .env .ANTHROPIC_API_KEY ! ,
98- ' claude-sonnet-4-20250514 ' ,
125+ const provider = new GeminiProvider (
126+ process .env .GEMINI_API_KEY ! ,
127+ ' gemini-2.0-flash-exp ' ,
99128 undefined , // baseUrl
100129 undefined , // proxyUrl
101130 {
@@ -105,9 +134,12 @@ const provider = new AnthropicProvider(
105134 allowMimeTypes: [ // Allowed MIME types
106135 ' image/jpeg' ,
107136 ' image/png' ,
108- ' image/gif' ,
109137 ' image/webp' ,
110138 ' application/pdf' ,
139+ ' audio/wav' ,
140+ ' audio/mp3' ,
141+ ' video/mp4' ,
142+ ' video/webm' ,
111143 ],
112144 },
113145 }
@@ -139,21 +171,37 @@ const provider = new AnthropicProvider(
139171| -----------| -----------| -------|
140172| ` application/pdf ` | ` .pdf ` | Anthropic, OpenAI (Responses API), Gemini |
141173
174+ ### Audio
175+
176+ | MIME Type | Extension | Notes |
177+ | -----------| -----------| -------|
178+ | ` audio/wav ` | ` .wav ` | OpenAI, Gemini |
179+ | ` audio/mp3 ` | ` .mp3 ` | OpenAI, Gemini |
180+ | ` audio/mpeg ` | ` .mp3 ` | OpenAI, Gemini |
181+ | ` audio/ogg ` | ` .ogg ` | Gemini only |
182+ | ` audio/flac ` | ` .flac ` | Gemini only |
183+
184+ ### Video
185+
186+ | MIME Type | Extension | Notes |
187+ | -----------| -----------| -------|
188+ | ` video/mp4 ` | ` .mp4 ` | Gemini only |
189+ | ` video/webm ` | ` .webm ` | Gemini only |
190+ | ` video/quicktime ` | ` .mov ` | Gemini only |
191+
142192---
143193
144194## Provider-Specific Notes
145195
146196### Anthropic
147197
148198- Supports images and PDF files
149- - Use ` files-api-2025-04-14 ` beta for file uploads
199+ - Files API beta header is automatically added when file blocks are detected
150200- Base64 images embedded directly in messages
201+ - ** Audio and video are not supported**
151202
152203``` typescript
153204const provider = new AnthropicProvider (apiKey , model , baseUrl , proxyUrl , {
154- beta: {
155- filesApi: true , // Enable Files API
156- },
157205 multimodal: {
158206 mode: ' url+base64' ,
159207 },
@@ -163,58 +211,98 @@ const provider = new AnthropicProvider(apiKey, model, baseUrl, proxyUrl, {
163211### OpenAI
164212
165213- Images: Supported in Chat Completions API
166- - PDF/Files: Requires Responses API (` openaiApi: 'responses' ` )
214+ - PDF/Files: Requires Responses API (` api: 'responses' ` )
215+ - Audio: Supports wav/mp3 formats via Chat Completions API ` input_audio ` type
216+ - ** Video is not supported** (use ` customFrameExtractor ` callback to extract frames as images)
167217
168218``` typescript
169219const provider = new OpenAIProvider (apiKey , model , baseUrl , proxyUrl , {
170220 api: ' responses' , // Required for PDF support
171221 multimodal: {
172222 mode: ' url+base64' ,
223+ allowMimeTypes: [
224+ ' image/jpeg' , ' image/png' , ' image/webp' ,
225+ ' audio/wav' , ' audio/mp3' ,
226+ ' application/pdf' ,
227+ ],
173228 },
174229});
175230```
176231
177232### Gemini
178233
179- - Supports images and PDF files
234+ - Supports images, PDF, audio, and video
180235- GIF format not supported
181- - Use ` mediaResolution ` option for image quality
236+ - Audio and video natively supported without special configuration
182237
183238``` typescript
184239const provider = new GeminiProvider (apiKey , model , baseUrl , proxyUrl , {
185- mediaResolution: ' high' , // 'low' | 'medium' | 'high'
186240 multimodal: {
187241 mode: ' url+base64' ,
242+ allowMimeTypes: [
243+ ' image/jpeg' , ' image/png' , ' image/webp' ,
244+ ' application/pdf' ,
245+ ' audio/wav' , ' audio/mp3' , ' audio/ogg' ,
246+ ' video/mp4' , ' video/webm' ,
247+ ],
248+ },
249+ });
250+ ```
251+
252+ ---
253+
254+ ## Video Fallback Handling
255+
256+ For providers that don't support video (like OpenAI), you can configure ` customFrameExtractor ` callback to extract video frames as images:
257+
258+ ``` typescript
259+ const multimodalConfig = {
260+ mode: ' url+base64' ,
261+ maxBase64Bytes: 20_000_000 ,
262+ video: {
263+ // Extract key frames when provider doesn't support video
264+ customFrameExtractor : async (video : { base64? : string ; url? : string ; mimeType? : string }) => {
265+ // Use ffmpeg or other tools to extract key frames
266+ // Return array of images
267+ return [
268+ { base64: ' ...' , mimeType: ' image/jpeg' },
269+ { base64: ' ...' , mimeType: ' image/jpeg' },
270+ ];
271+ },
188272 },
273+ };
274+
275+ const provider = new OpenAIProvider (apiKey , model , baseUrl , proxyUrl , {
276+ multimodal: multimodalConfig ,
189277});
190278```
191279
192280---
193281
194282## Best Practices
195283
196- ### 1. Use Appropriate Image Sizes
284+ ### 1. Use Appropriate File Sizes
197285
198- Large images increase token usage and latency. Resize images before sending:
286+ Large files increase token usage and latency. Resize before sending:
199287
200288``` typescript
201- // Recommendation: Keep images under 1MB for optimal performance
289+ // Recommendation: Keep files under 1MB for optimal performance
202290const maxBytes = 1024 * 1024 ; // 1MB
203291
204- function validateImageSize (base64 : string ): boolean {
292+ function validateFileSize (base64 : string ): boolean {
205293 const bytes = Math .ceil (base64 .length * 3 / 4 );
206294 return bytes <= maxBytes ;
207295}
208296```
209297
210298### 2. Handle Multimodal Context Retention
211299
212- For long conversations with many images , configure retention to avoid context overflow:
300+ For long conversations with many multimedia files , configure retention to avoid context overflow:
213301
214302``` typescript
215303const agent = await Agent .create ({
216304 templateId: ' vision-assistant' ,
217- multimodalRetention: { keepRecent: 2 }, // Keep only recent 2 images
305+ multimodalRetention: { keepRecent: 2 }, // Keep only recent 2 multimedia messages
218306 context: {
219307 maxTokens: 100_000 ,
220308 compressToTokens: 60_000 ,
@@ -227,19 +315,22 @@ const agent = await Agent.create({
227315Always validate MIME types before sending:
228316
229317``` typescript
230- const ALLOWED_IMAGE_TYPES = [' image/jpeg' , ' image/png' , ' image/webp' ];
318+ const ALLOWED_TYPES: Record <string , string []> = {
319+ image: [' image/jpeg' , ' image/png' , ' image/webp' ],
320+ audio: [' audio/wav' , ' audio/mp3' , ' audio/mpeg' ],
321+ video: [' video/mp4' , ' video/webm' ],
322+ };
231323
232- function getImageMimeType (filename : string ): string {
324+ function getMimeType (filename : string , category : ' image ' | ' audio ' | ' video ' ): string {
233325 const ext = filename .toLowerCase ().split (' .' ).pop ();
234326 const mimeMap: Record <string , string > = {
235- jpg: ' image/jpeg' ,
236- jpeg: ' image/jpeg' ,
237- png: ' image/png' ,
238- webp: ' image/webp' ,
327+ jpg: ' image/jpeg' , jpeg: ' image/jpeg' , png: ' image/png' , webp: ' image/webp' ,
328+ wav: ' audio/wav' , mp3: ' audio/mp3' ,
329+ mp4: ' video/mp4' , webm: ' video/webm' ,
239330 };
240331 const mimeType = mimeMap [ext ! ];
241- if (! mimeType || ! ALLOWED_IMAGE_TYPES .includes (mimeType )) {
242- throw new Error (` Unsupported image type: ${ext } ` );
332+ if (! mimeType || ! ALLOWED_TYPES [ category ] .includes (mimeType )) {
333+ throw new Error (` Unsupported ${ category } type: ${ext } ` );
243334 }
244335 return mimeType ;
245336}
@@ -254,22 +345,25 @@ Common multimodal errors:
254345| Error | Cause | Solution |
255346| -------| -------| ----------|
256347| ` MultimodalValidationError: Base64 is not allowed ` | ` mode ` set to ` 'url' ` only | Set ` mode: 'url+base64' ` |
257- | ` MultimodalValidationError: base64 payload too large ` | Exceeds ` maxBase64Bytes ` | Resize image or increase limit |
348+ | ` MultimodalValidationError: base64 payload too large ` | Exceeds ` maxBase64Bytes ` | Resize file or increase limit |
258349| ` MultimodalValidationError: mime_type not allowed ` | MIME type not in allowlist | Add to ` allowMimeTypes ` |
259350| ` MultimodalValidationError: Missing url/file_id/base64 ` | No content source provided | Provide ` url ` , ` file_id ` , or ` base64 ` |
351+ | ` UnsupportedContentBlockError: Unsupported content block type: video ` | Provider doesn't support video | Use Gemini or configure ` customFrameExtractor ` |
260352
261353---
262354
263- ## Complete Example
355+ ## Complete Examples
356+
357+ ### Image Analysis Example
264358
265359``` typescript
266- import { Agent , AnthropicProvider , JSONStore , ContentBlock } from ' @shareai-lab/kode-sdk' ;
360+ import { Agent , GeminiProvider , JSONStore , ContentBlock } from ' @shareai-lab/kode-sdk' ;
267361import * as fs from ' fs' ;
268362
269363async function analyzeImage() {
270- const provider = new AnthropicProvider (
271- process .env .ANTHROPIC_API_KEY ! ,
272- ' claude-sonnet-4-20250514 ' ,
364+ const provider = new GeminiProvider (
365+ process .env .GEMINI_API_KEY ! ,
366+ ' gemini-2.0-flash-exp ' ,
273367 undefined ,
274368 undefined ,
275369 {
@@ -294,7 +388,6 @@ async function analyzeImage() {
294388 modelFactory : () => provider ,
295389 });
296390
297- // Read and send image
298391 const imageBuffer = fs .readFileSync (' ./photo.jpg' );
299392 const base64 = imageBuffer .toString (' base64' );
300393
@@ -303,14 +396,48 @@ async function analyzeImage() {
303396 { type: ' image' , base64 , mime_type: ' image/jpeg' }
304397 ];
305398
306- for await (const envelope of agent .subscribe ([' progress' ])) {
399+ // Use chatStream for streaming responses
400+ for await (const envelope of agent .chatStream (content )) {
307401 if (envelope .event .type === ' text_chunk' ) {
308402 process .stdout .write (envelope .event .delta );
309403 }
310404 if (envelope .event .type === ' done' ) break ;
311405 }
406+ }
407+ ```
408+
409+ ### Audio Transcription Example
410+
411+ ``` typescript
412+ async function transcribeAudio() {
413+ const audioBuffer = fs .readFileSync (' ./speech.wav' );
414+ const base64 = audioBuffer .toString (' base64' );
415+
416+ const content: ContentBlock [] = [
417+ { type: ' text' , text: ' Please transcribe this audio and identify the speaker\' s emotion.' },
418+ { type: ' audio' , base64 , mime_type: ' audio/wav' }
419+ ];
420+
421+ const response = await agent .chat (content );
422+ console .log (response .text );
423+ }
424+ ```
425+
426+ ### Video Analysis Example
427+
428+ ``` typescript
429+ async function analyzeVideo() {
430+ const videoBuffer = fs .readFileSync (' ./clip.mp4' );
431+ const base64 = videoBuffer .toString (' base64' );
432+
433+ const content: ContentBlock [] = [
434+ { type: ' text' , text: ' What is happening in this video? Please describe in detail.' },
435+ { type: ' video' , base64 , mime_type: ' video/mp4' }
436+ ];
312437
313- await agent .send (content );
438+ // Note: Only Gemini supports video
439+ const response = await agent .chat (content );
440+ console .log (response .text );
314441}
315442```
316443
0 commit comments