|
99 | 99 | "source": [ |
100 | 100 | "## Overview\n", |
101 | 101 | "\n", |
102 | | - "The Live API enables low-latency bidirectional voice and video interactions with Gemini. The API can process text, audio, and video input, and it can provide text and audio output. This tutorial demonstrates the following simple examples to help you get started with the Live API in Vertex AI.\n", |
103 | | - "\n", |
104 | | - "- Text-to-text generation\n", |
105 | | - "- Text-to-audio generation\n", |
106 | | - "- Text-to-audio conversation\n", |
107 | | - "- Function calling\n", |
108 | | - "- Code execution\n", |
109 | | - "- Audio transcription\n", |
110 | | - "\n", |
111 | | - "See the [Live API](https://cloud.google.com/vertex-ai/generative-ai/docs/model-reference/multimodal-live) page for more details." |
| 102 | + "The Live API enables low-latency bidirectional voice and video interactions with Gemini. The API can process text, audio, and video input, and it can provide text and audio output. \n", |
| 103 | + "\n", |
| 104 | + "See the [Live API](https://cloud.google.com/vertex-ai/generative-ai/docs/model-reference/multimodal-live) page for more details.\n", |
| 105 | + "\n", |
| 106 | + "This tutorial demonstrates the following simple examples to help you get started with the Live API in Vertex AI using [WebSockets](https://en.wikipedia.org/wiki/WebSocket).\n", |
| 107 | + "\n", |
| 108 | + "- Using Gemini 2.0 Flash\n", |
| 109 | + " - Text-to-text generation\n", |
| 110 | + " - Text-to-audio generation\n", |
| 111 | + " - Text-to-audio conversation\n", |
| 112 | + " - Function calling\n", |
| 113 | + " - Code execution\n", |
| 114 | + " - Audio transcription\n", |
| 115 | + "- Using Gemini 2.5 Flash native audio dialog\n", |
| 116 | + " - Proactive audio\n", |
| 117 | + " - Affective dialog" |
112 | 118 | ] |
113 | 119 | }, |
114 | 120 | { |
|
217 | 223 | "bearer_token = !gcloud auth application-default print-access-token" |
218 | 224 | ] |
219 | 225 | }, |
220 | | - { |
221 | | - "cell_type": "markdown", |
222 | | - "metadata": { |
223 | | - "id": "5M7EKckIYVFy" |
224 | | - }, |
225 | | - "source": [ |
226 | | - "### Use the Gemini 2.0 Flash model\n", |
227 | | - "\n", |
228 | | - "Live API is a new capability introduced with the [Gemini 2.0 Flash model](https://cloud.google.com/vertex-ai/generative-ai/docs/gemini-v2)." |
229 | | - ] |
230 | | - }, |
231 | | - { |
232 | | - "cell_type": "code", |
233 | | - "execution_count": 26, |
234 | | - "metadata": { |
235 | | - "id": "-coEslfWPrxo" |
236 | | - }, |
237 | | - "outputs": [], |
238 | | - "source": [ |
239 | | - "MODEL_ID = \"gemini-2.0-flash-live-preview-04-09\" # @param {type: \"string\"}\n", |
240 | | - "\n", |
241 | | - "MODEL = (\n", |
242 | | - " f\"projects/{PROJECT_ID}/locations/{LOCATION}/publishers/google/models/{MODEL_ID}\"\n", |
243 | | - ")" |
244 | | - ] |
245 | | - }, |
246 | 226 | { |
247 | 227 | "cell_type": "markdown", |
248 | 228 | "metadata": { |
|
271 | 251 | { |
272 | 252 | "cell_type": "markdown", |
273 | 253 | "metadata": { |
274 | | - "id": "k9jAArxzClXz" |
| 254 | + "id": "5M7EKckIYVFy" |
| 255 | + }, |
| 256 | + "source": [ |
| 257 | + "## Using the Gemini 2.0 Flash\n", |
| 258 | + "\n", |
| 259 | + "Live API is a new capability introduced with the [Gemini 2.0 Flash model](https://cloud.google.com/vertex-ai/generative-ai/docs/gemini-v2)." |
| 260 | + ] |
| 261 | + }, |
| 262 | + { |
| 263 | + "cell_type": "code", |
| 264 | + "execution_count": 26, |
| 265 | + "metadata": { |
| 266 | + "id": "-coEslfWPrxo" |
275 | 267 | }, |
| 268 | + "outputs": [], |
276 | 269 | "source": [ |
277 | | - "## Use the Live API\n", |
| 270 | + "MODEL_ID = \"gemini-2.0-flash-live-preview-04-09\" # @param {type: \"string\"}\n", |
278 | 271 | "\n", |
279 | | - "The Live API is a stateful API that uses [WebSockets](https://en.wikipedia.org/wiki/WebSocket). This section shows some basic examples of how to use the Live API for text-to-text and text-to-audio generation." |
| 272 | + "MODEL = (\n", |
| 273 | + " f\"projects/{PROJECT_ID}/locations/{LOCATION}/publishers/google/models/{MODEL_ID}\"\n", |
| 274 | + ")" |
280 | 275 | ] |
281 | 276 | }, |
282 | 277 | { |
|
1290 | 1285 | " )" |
1291 | 1286 | ] |
1292 | 1287 | }, |
| 1288 | + { |
| 1289 | + "cell_type": "markdown", |
| 1290 | + "metadata": { |
| 1291 | + "id": "73abaaf93010" |
| 1292 | + }, |
| 1293 | + "source": [ |
| 1294 | + "## Using the Gemini 2.5 Flash native audio dialog\n", |
| 1295 | + "\n", |
| 1296 | + "\n", |
| 1297 | + "Gemini 2.5 Flash with Live API features native audio dialog capabilities. In addition to the standard Live API features, this model includes:\n", |
| 1298 | + "\n", |
| 1299 | + "- Enhanced voice quality and adaptability\n", |
| 1300 | + "- Introducing proactive audio\n", |
| 1301 | + "- Introducing affective dialog\n", |
| 1302 | + "\n", |
| 1303 | + "\n", |
| 1304 | + "**Note** that these capabilities are currently in private preview only." |
| 1305 | + ] |
| 1306 | + }, |
| 1307 | + { |
| 1308 | + "cell_type": "code", |
| 1309 | + "execution_count": null, |
| 1310 | + "metadata": { |
| 1311 | + "id": "b75d699bbbb6" |
| 1312 | + }, |
| 1313 | + "outputs": [], |
| 1314 | + "source": [ |
| 1315 | + "MODEL_ID = \"gemini-2.5-flash-preview-native-audio-dialog\" # @param {type: \"string\"}\n", |
| 1316 | + "\n", |
| 1317 | + "MODEL = (\n", |
| 1318 | + " f\"projects/{PROJECT_ID}/locations/{LOCATION}/publishers/google/models/{MODEL_ID}\"\n", |
| 1319 | + ")" |
| 1320 | + ] |
| 1321 | + }, |
| 1322 | + { |
| 1323 | + "cell_type": "markdown", |
| 1324 | + "metadata": { |
| 1325 | + "id": "9904f4fbc2ab" |
| 1326 | + }, |
| 1327 | + "source": [ |
| 1328 | + "### **Example 8**: Proactive audio\n", |
| 1329 | + "\n", |
| 1330 | + "\n", |
| 1331 | + "When proactive audio is enabled, the model only responds when it's relevant. The model generates text transcripts and audio responses proactively only for queries directed to the device, and does not respond to non-device directed queries." |
| 1332 | + ] |
| 1333 | + }, |
| 1334 | + { |
| 1335 | + "cell_type": "code", |
| 1336 | + "execution_count": null, |
| 1337 | + "metadata": { |
| 1338 | + "id": "b86a89dda3ae" |
| 1339 | + }, |
| 1340 | + "outputs": [], |
| 1341 | + "source": [ |
| 1342 | + "# Set model generation_config\n", |
| 1343 | + "GENERATION_CONFIG = {\n", |
| 1344 | + " \"response_modalities\": [\"AUDIO\"],\n", |
| 1345 | + "}\n", |
| 1346 | + "\n", |
| 1347 | + "\n", |
| 1348 | + "headers = {\n", |
| 1349 | + " \"Content-Type\": \"application/json\",\n", |
| 1350 | + " \"Authorization\": f\"Bearer {bearer_token[0]}\",\n", |
| 1351 | + "}\n", |
| 1352 | + "\n", |
| 1353 | + "# Connect to the server\n", |
| 1354 | + "async with connect(SERVICE_URL, additional_headers=headers) as ws:\n", |
| 1355 | + " # Setup the session\n", |
| 1356 | + " await ws.send(\n", |
| 1357 | + " json.dumps(\n", |
| 1358 | + " {\n", |
| 1359 | + " \"setup\": {\n", |
| 1360 | + " \"model\": MODEL,\n", |
| 1361 | + " \"generation_config\": GENERATION_CONFIG,\n", |
| 1362 | + " \"input_audio_transcription\": {},\n", |
| 1363 | + " \"output_audio_transcription\": {},\n", |
| 1364 | + " \"proactivity\": {\"proactive_audio\": True},\n", |
| 1365 | + " }\n", |
| 1366 | + " }\n", |
| 1367 | + " )\n", |
| 1368 | + " )\n", |
| 1369 | + "\n", |
| 1370 | + " # Receive setup response\n", |
| 1371 | + " raw_response = await ws.recv(decode=False)\n", |
| 1372 | + " setup_response = json.loads(raw_response.decode(\"ascii\"))\n", |
| 1373 | + "\n", |
| 1374 | + " # Send text message\n", |
| 1375 | + " text_input = \"Hello? Gemini are you there?\"\n", |
| 1376 | + " display(Markdown(f\"**Input:** {text_input}\"))\n", |
| 1377 | + "\n", |
| 1378 | + " msg = {\n", |
| 1379 | + " \"client_content\": {\n", |
| 1380 | + " \"turns\": [{\"role\": \"user\", \"parts\": [{\"text\": text_input}]}],\n", |
| 1381 | + " \"turn_complete\": True,\n", |
| 1382 | + " }\n", |
| 1383 | + " }\n", |
| 1384 | + "\n", |
| 1385 | + " await ws.send(json.dumps(msg))\n", |
| 1386 | + "\n", |
| 1387 | + " responses = []\n", |
| 1388 | + " input_transcriptions = []\n", |
| 1389 | + " output_transcriptions = []\n", |
| 1390 | + "\n", |
| 1391 | + " # Receive chucks of server response\n", |
| 1392 | + " async for raw_response in ws:\n", |
| 1393 | + " response = json.loads(raw_response.decode())\n", |
| 1394 | + " server_content = response.pop(\"serverContent\", None)\n", |
| 1395 | + " if server_content is None:\n", |
| 1396 | + " break\n", |
| 1397 | + "\n", |
| 1398 | + " if (\n", |
| 1399 | + " input_transcription := server_content.get(\"inputTranscription\")\n", |
| 1400 | + " ) is not None:\n", |
| 1401 | + " if (text := input_transcription.get(\"text\")) is not None:\n", |
| 1402 | + " input_transcriptions.append(text)\n", |
| 1403 | + " if (\n", |
| 1404 | + " output_transcription := server_content.get(\"outputTranscription\")\n", |
| 1405 | + " ) is not None:\n", |
| 1406 | + " if (text := output_transcription.get(\"text\")) is not None:\n", |
| 1407 | + " output_transcriptions.append(text)\n", |
| 1408 | + "\n", |
| 1409 | + " model_turn = server_content.pop(\"modelTurn\", None)\n", |
| 1410 | + " if model_turn is not None:\n", |
| 1411 | + " parts = model_turn.pop(\"parts\", None)\n", |
| 1412 | + " if parts is not None:\n", |
| 1413 | + " for part in parts:\n", |
| 1414 | + " pcm_data = base64.b64decode(part[\"inlineData\"][\"data\"])\n", |
| 1415 | + " responses.append(np.frombuffer(pcm_data, dtype=np.int16))\n", |
| 1416 | + "\n", |
| 1417 | + " # End of turn\n", |
| 1418 | + " turn_complete = server_content.pop(\"turnComplete\", None)\n", |
| 1419 | + " if turn_complete:\n", |
| 1420 | + " break\n", |
| 1421 | + "\n", |
| 1422 | + " if input_transcriptions:\n", |
| 1423 | + " display(Markdown(f\"**Input transcription >** {''.join(input_transcriptions)}\"))\n", |
| 1424 | + "\n", |
| 1425 | + " if responses:\n", |
| 1426 | + " # Play the returned audio message\n", |
| 1427 | + " display(Audio(np.concatenate(responses), rate=24000, autoplay=True))\n", |
| 1428 | + "\n", |
| 1429 | + " if output_transcriptions:\n", |
| 1430 | + " display(\n", |
| 1431 | + " Markdown(f\"**Output transcription >** {''.join(output_transcriptions)}\")\n", |
| 1432 | + " )" |
| 1433 | + ] |
| 1434 | + }, |
| 1435 | + { |
| 1436 | + "cell_type": "markdown", |
| 1437 | + "metadata": { |
| 1438 | + "id": "c25132281c4e" |
| 1439 | + }, |
| 1440 | + "source": [ |
| 1441 | + "### **Example 9**: Affective Dialog\n", |
| 1442 | + "\n", |
| 1443 | + "When affective dialog is enabled, the model can understand and respond appropriately to users' emotional expressions for more nuanced conversations." |
| 1444 | + ] |
| 1445 | + }, |
| 1446 | + { |
| 1447 | + "cell_type": "code", |
| 1448 | + "execution_count": null, |
| 1449 | + "metadata": { |
| 1450 | + "id": "ae07905ba242" |
| 1451 | + }, |
| 1452 | + "outputs": [], |
| 1453 | + "source": [ |
| 1454 | + "# Set model generation_config\n", |
| 1455 | + "GENERATION_CONFIG = {\n", |
| 1456 | + " \"response_modalities\": [\"AUDIO\"],\n", |
| 1457 | + " \"enable_affective_dialog\": True,\n", |
| 1458 | + "}\n", |
| 1459 | + "\n", |
| 1460 | + "headers = {\n", |
| 1461 | + " \"Content-Type\": \"application/json\",\n", |
| 1462 | + " \"Authorization\": f\"Bearer {bearer_token[0]}\",\n", |
| 1463 | + "}\n", |
| 1464 | + "\n", |
| 1465 | + "# Connect to the server\n", |
| 1466 | + "async with connect(SERVICE_URL, additional_headers=headers) as ws:\n", |
| 1467 | + " # Setup the session\n", |
| 1468 | + " await ws.send(\n", |
| 1469 | + " json.dumps(\n", |
| 1470 | + " {\n", |
| 1471 | + " \"setup\": {\n", |
| 1472 | + " \"model\": MODEL,\n", |
| 1473 | + " \"generation_config\": GENERATION_CONFIG,\n", |
| 1474 | + " }\n", |
| 1475 | + " }\n", |
| 1476 | + " )\n", |
| 1477 | + " )\n", |
| 1478 | + "\n", |
| 1479 | + " # Receive setup response\n", |
| 1480 | + " raw_response = await ws.recv(decode=False)\n", |
| 1481 | + " setup_response = json.loads(raw_response.decode(\"ascii\"))\n", |
| 1482 | + "\n", |
| 1483 | + " # Send text message\n", |
| 1484 | + " text_input = \"Hello? Gemini are you there? It's really a good day!\"\n", |
| 1485 | + " display(Markdown(f\"**Input:** {text_input}\"))\n", |
| 1486 | + "\n", |
| 1487 | + " msg = {\n", |
| 1488 | + " \"client_content\": {\n", |
| 1489 | + " \"turns\": [{\"role\": \"user\", \"parts\": [{\"text\": text_input}]}],\n", |
| 1490 | + " \"turn_complete\": True,\n", |
| 1491 | + " }\n", |
| 1492 | + " }\n", |
| 1493 | + "\n", |
| 1494 | + " await ws.send(json.dumps(msg))\n", |
| 1495 | + "\n", |
| 1496 | + " responses = []\n", |
| 1497 | + "\n", |
| 1498 | + " # Receive chucks of server response\n", |
| 1499 | + " async for raw_response in ws:\n", |
| 1500 | + " response = json.loads(raw_response.decode())\n", |
| 1501 | + " server_content = response.pop(\"serverContent\", None)\n", |
| 1502 | + " if server_content is None:\n", |
| 1503 | + " break\n", |
| 1504 | + "\n", |
| 1505 | + " model_turn = server_content.pop(\"modelTurn\", None)\n", |
| 1506 | + " if model_turn is not None:\n", |
| 1507 | + " parts = model_turn.pop(\"parts\", None)\n", |
| 1508 | + " if parts is not None:\n", |
| 1509 | + " for part in parts:\n", |
| 1510 | + " pcm_data = base64.b64decode(part[\"inlineData\"][\"data\"])\n", |
| 1511 | + " responses.append(np.frombuffer(pcm_data, dtype=np.int16))\n", |
| 1512 | + "\n", |
| 1513 | + " # End of turn\n", |
| 1514 | + " turn_complete = server_content.pop(\"turnComplete\", None)\n", |
| 1515 | + " if turn_complete:\n", |
| 1516 | + " break\n", |
| 1517 | + "\n", |
| 1518 | + " # Play the returned audio message\n", |
| 1519 | + " display(Audio(np.concatenate(responses), rate=24000, autoplay=True))" |
| 1520 | + ] |
| 1521 | + }, |
1293 | 1522 | { |
1294 | 1523 | "cell_type": "markdown", |
1295 | 1524 | "metadata": { |
|
0 commit comments