diff --git a/notebooks/official/generative_ai/voyage-multimodal-3.5.ipynb b/notebooks/official/generative_ai/voyage-multimodal-3.5.ipynb new file mode 100644 index 000000000..925bfb2f4 --- /dev/null +++ b/notebooks/official/generative_ai/voyage-multimodal-3.5.ipynb @@ -0,0 +1,1324 @@ +{ + "cells": [ + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "6b3283cdfd08" + }, + "outputs": [], + "source": [ + "# Copyright 2026 MongoDB, Inc\n", + "#\n", + "# Licensed under the Apache License, Version 2.0 (the \"License\");\n", + "# you may not use this file except in compliance with the License.\n", + "# You may obtain a copy of the License at\n", + "#\n", + "# https://www.apache.org/licenses/LICENSE-2.0\n", + "#\n", + "# Unless required by applicable law or agreed to in writing, software\n", + "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", + "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", + "# See the License for the specific language governing permissions and\n", + "# limitations under the License." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "36b86ddc6744" + }, + "source": [ + "# Voyage Multimodal 3.5\n", + "\n", + "This notebook demonstrates how to deploy and use the Voyage Multimodal 3.5 embedding model.\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + "
\n", + " \n", + " \"Google
Open in Colab\n", + "
\n", + "
\n", + " \n", + " \"Google
Open in Colab Enterprise\n", + "
\n", + "
\n", + " \n", + " \"Vertex
Open in Workbench\n", + "
\n", + "
\n", + " \n", + " \"GitHub
View on GitHub\n", + "
\n", + "
" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "d3d3e7fcda4f" + }, + "source": [ + "## Overview\n", + "\n", + "**Voyage Multimodal 3.5** is a state-of-the-art multimodal embedding model designed for cross-modal semantic search, retrieval-augmented generation (RAG), and intelligent AI applications. This model provides:\n", + "\n", + "* **Multimodal Understanding**: Vectorize text, images, and video individually or interleaved together\n", + "* **Cross-Modal Search**: Excellent performance for mixed-modality searches involving text and visual content\n", + "* **Flexible Dimensions**: Support for 256, 512, 1024, and 2048 dimensions via Matryoshka learning\n", + "* **Quantization Options**: Multiple quantization formats for optimal storage and performance\n", + "* **Maximum 32K tokens input**: Support for long documents and multiple media items\n", + "\n", + "### What you'll learn\n", + "\n", + "In this notebook, you will:\n", + "\n", + "* Deploy the Voyage Multimodal 3.5 model to a Vertex AI endpoint\n", + "* Generate embeddings for text, images, and video\n", + "* Create multimodal embeddings combining text and images\n", + "* Use embeddings for cross-modal semantic similarity\n", + "* Clean up resources after use\n", + "\n", + "### Costs\n", + "\n", + "This tutorial uses billable components of Google Cloud:\n", + "\n", + "* Vertex AI Model Garden\n", + "* Vertex AI Prediction endpoints\n", + "\n", + "Learn about [Vertex AI pricing](https://cloud.google.com/vertex-ai/pricing) and use the [Pricing Calculator](https://cloud.google.com/products/calculator/) to generate a cost estimate based on your projected usage." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "4d998a5140b2" + }, + "source": [ + "## Get started" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "b92cb16aea9c" + }, + "source": [ + "### Install Vertex AI SDK for Python and other required packages\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "030faea19be1" + }, + "outputs": [], + "source": [ + "! pip3 install --upgrade --quiet google-cloud-aiplatform numpy" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "848322ec177e" + }, + "source": [ + "### Restart runtime (Colab only)\n", + "\n", + "To use the newly installed packages, you must restart the runtime on Google Colab." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "b8d49bb74a53" + }, + "outputs": [], + "source": [ + "import sys\n", + "\n", + "if \"google.colab\" in sys.modules:\n", + "\n", + " import IPython\n", + "\n", + " app = IPython.Application.instance()\n", + " app.kernel.do_shutdown(True)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "780490bfb862" + }, + "source": [ + "
\n", + "⚠️ The kernel is going to restart. Wait until it's finished before continuing to the next step. ⚠️\n", + "
\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "1117fcd212f8" + }, + "source": [ + "### Authenticate your notebook environment (Colab only)\n", + "\n", + "Authenticate your environment on Google Colab.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "015bf6d5da75" + }, + "outputs": [], + "source": [ + "import sys\n", + "\n", + "if \"google.colab\" in sys.modules:\n", + "\n", + " from google.colab import auth\n", + "\n", + " auth.authenticate_user()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "722a10c66085" + }, + "source": [ + "### Set Google Cloud project information and initialize Vertex AI SDK for Python\n", + "\n", + "To get started using Vertex AI, you must have an existing Google Cloud project and [enable the Vertex AI API](https://console.cloud.google.com/flows/enableapi?apiid=aiplatform.googleapis.com). Learn more about [setting up a project and a development environment](https://cloud.google.com/vertex-ai/docs/start/cloud-environment)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "ead0f677c004" + }, + "outputs": [], + "source": [ + "# @title Setup Google Cloud project\n", + "\n", + "# Set your Google Cloud project ID and region below:\n", + "\n", + "import os\n", + "\n", + "import vertexai\n", + "\n", + "# @markdown Enter your project ID if not auto-detected:\n", + "PROJECT_ID = \"[your-project-id]\" # @param {type:\"string\"}\n", + "if not PROJECT_ID or PROJECT_ID == \"[your-project-id]\":\n", + " PROJECT_ID = os.environ.get(\"GOOGLE_CLOUD_PROJECT\")\n", + "\n", + "# @markdown Select your region:\n", + "LOCATION = \"us-central1\" # @param [\"us-central1\", \"us-east1\", \"us-west1\", \"europe-west1\", \"europe-west4\", \"asia-east1\", \"asia-southeast1\"]\n", + "\n", + "print(f\"Project ID: {PROJECT_ID}\")\n", + "print(f\"Location: {LOCATION}\")\n", + "\n", + "vertexai.init(project=PROJECT_ID, location=LOCATION)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "23acca3ed72b" + }, + "source": [ + "## Deploy model" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "790c7cc43b1c" + }, + "source": [ + "### Initialize the Model\n", + "\n", + "Initialize the Voyage Multimodal 3.5 model from Model Garden.\n", + "\n", + "Use the `list_deploy_options()` method to view the verified deployment configurations for your selected model. This helps ensure you have sufficient resources (e.g., GPU quota) available to deploy it." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "56d52ebdc7c3" + }, + "outputs": [], + "source": [ + "from vertexai import model_garden\n", + "\n", + "MODEL_NAME = \"mongodb/voyage-multimodal-3.5@latest\"\n", + "model = model_garden.OpenModel(MODEL_NAME)\n", + "\n", + "deploy_options = model.list_deploy_options(concise=True)\n", + "print(deploy_options)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "c2f4541829d7" + }, + "source": [ + "### Deploy the Model\n", + "\n", + "Now that you've reviewed the deployment options, use the `deploy()` method to serve the Voyage Multimodal 3.5 model to a Vertex AI endpoint. Deployment time may vary depending on infrastructure requirements.\n", + "\n", + "You can either deploy a new model or use an existing endpoint. Set `use_dedicated_endpoint` to `True` as voyage-multimodal-3.5 requires a [dedicated endpoint](https://cloud.google.com/vertex-ai/docs/general/deployment#create-dedicated-endpoint)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "c797fed2dd9a" + }, + "outputs": [], + "source": [ + "# @title Deploy or connect to endpoint\n", + "# @markdown Choose whether to deploy a new model or use an existing endpoint:\n", + "\n", + "deployment_option = \"deploy_new\" # @param [\"deploy_new\", \"use_existing\"]\n", + "\n", + "# @markdown ---\n", + "# @markdown If using existing endpoint, provide the endpoint ID:\n", + "ENDPOINT_ID = \"\" # @param {type:\"string\"}\n", + "\n", + "if deployment_option == \"deploy_new\":\n", + " print(\"Deploying new model...\")\n", + " endpoint = model.deploy(\n", + " accept_eula=True,\n", + " use_dedicated_endpoint=True,\n", + " )\n", + " print(f\"Endpoint deployed: {endpoint.display_name}\")\n", + " print(f\"Endpoint resource name: {endpoint.resource_name}\")\n", + "else:\n", + " if not ENDPOINT_ID:\n", + " raise ValueError(\"Please provide an ENDPOINT_ID when using existing endpoint\")\n", + "\n", + " from google.cloud import aiplatform\n", + "\n", + " print(f\"Connecting to existing endpoint: {ENDPOINT_ID}\")\n", + " endpoint = aiplatform.Endpoint(\n", + " endpoint_name=f\"projects/{PROJECT_ID}/locations/{LOCATION}/endpoints/{ENDPOINT_ID}\"\n", + " )\n", + " print(f\"Using endpoint: {endpoint.display_name}\")\n", + " print(f\"Endpoint resource name: {endpoint.resource_name}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "5c2a64096489" + }, + "source": [ + "### Advanced Deployment Configuration (Optional)\n", + "\n", + "To further customize your deployment, you can configure:\n", + "\n", + "- **Compute Resources**: Machine type, replica count (min/max), accelerator type and quantity.\n", + "- **Infrastructure**: Use Spot VMs, reservation affinity, or dedicated endpoints.\n", + "- **Serving Container**: Customize container image, ports, health checks, and environment variables.\n", + "\n", + "See the [Model Garden SDK README](https://github.com/googleapis/python-aiplatform/blob/main/vertexai/model_garden/README.md) for advanced configuration options." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "f49e74ef5207" + }, + "source": [ + "## Generate embeddings with Voyage Multimodal 3.5\n", + "\n", + "Now that the model is deployed, you can generate embeddings for text, images, video, or any combination of these modalities.\n", + "\n", + "The multimodal API uses a different input format than text-only models. Each input is an object with a `content` array containing typed elements:\n", + "\n", + "- **Text**: `{\"type\": \"text\", \"text\": \"your text here\"}`\n", + "- **Image URL**: `{\"type\": \"image_url\", \"image_url\": \"https://...\"}`\n", + "- **Image Base64**: `{\"type\": \"image_base64\", \"image_base64\": \"data:image/jpeg;base64,...\"}`\n", + "- **Video URL**: `{\"type\": \"video_url\", \"video_url\": \"https://...\"}`\n", + "- **Video Base64**: `{\"type\": \"video_base64\", \"video_base64\": \"data:video/mp4;base64,...\"}`" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "3debc925a049" + }, + "source": [ + "### Text embeddings\n", + "\n", + "Generate embeddings for text inputs:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "3fb5010b814e" + }, + "outputs": [], + "source": [ + "import json\n", + "\n", + "# Text inputs to embed\n", + "texts = [\n", + " \"A photo of a golden retriever playing in the park.\",\n", + " \"Machine learning enables computers to learn from data.\",\n", + " \"A beautiful sunset over the ocean with orange and purple skies.\",\n", + " \"The quarterly financial report shows strong revenue growth.\",\n", + "]\n", + "\n", + "# Format inputs for multimodal API\n", + "inputs = [{\"content\": [{\"type\": \"text\", \"text\": t}]} for t in texts]\n", + "\n", + "# Prepare the request\n", + "body = {\"model\": \"voyage-multimodal-3.5\", \"inputs\": inputs, \"input_type\": \"document\"}\n", + "\n", + "response = endpoint.invoke(\n", + " request_path=\"/multimodalembeddings\",\n", + " body=json.dumps(body).encode(\"utf-8\"),\n", + " headers={\"Content-Type\": \"application/json\"},\n", + ")\n", + "\n", + "# Extract embeddings\n", + "result = response.json()\n", + "embeddings = [item[\"embedding\"] for item in result[\"data\"]]\n", + "\n", + "print(f\"Number of texts embedded: {len(embeddings)}\")\n", + "print(f\"Embedding dimension: {len(embeddings[0])}\")\n", + "print(f\"\\nFirst embedding (first 5 values): {embeddings[0][:5]}\")\n", + "print(f\"\\nUsage: {result.get('usage', {})}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "2444542a881d" + }, + "source": [ + "### Image embeddings\n", + "\n", + "Generate embeddings for images. You can provide images via URL or base64-encoded data.\n", + "\n", + "#### Using image URLs" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "f76afde2df11" + }, + "outputs": [], + "source": [ + "import json\n", + "\n", + "# Example image from Voyage AI's documentation\n", + "image_url = \"https://raw.githubusercontent.com/voyage-ai/voyage-multimodal-3/refs/heads/main/images/banana.jpg\"\n", + "\n", + "# Format input with image URL\n", + "inputs = [{\"content\": [{\"type\": \"image_url\", \"image_url\": image_url}]}]\n", + "\n", + "body = {\"model\": \"voyage-multimodal-3.5\", \"inputs\": inputs, \"input_type\": \"document\"}\n", + "\n", + "response = endpoint.invoke(\n", + " request_path=\"/multimodalembeddings\",\n", + " body=json.dumps(body).encode(\"utf-8\"),\n", + " headers={\"Content-Type\": \"application/json\"},\n", + ")\n", + "\n", + "result = response.json()\n", + "image_embedding = result[\"data\"][0][\"embedding\"]\n", + "\n", + "print(f\"Embedding dimension: {len(image_embedding)}\")\n", + "print(f\"Embedding (first 5 values): {image_embedding[:5]}\")\n", + "print(f\"\\nUsage: {result.get('usage', {})}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "c4c375633532" + }, + "source": [ + "#### Using base64-encoded images\n", + "\n", + "For local images, use Google Colab's file upload interface:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "2045aa376a85" + }, + "outputs": [], + "source": [ + "import base64\n", + "import json\n", + "import sys\n", + "\n", + "\n", + "def encode_image_base64(image_bytes: bytes, filename: str) -> str:\n", + " \"\"\"Encode image bytes as a base64 data URI.\"\"\"\n", + " # Determine MIME type from extension\n", + " extension = filename.lower().split(\".\")[-1]\n", + " mime_types = {\n", + " \"jpg\": \"image/jpeg\",\n", + " \"jpeg\": \"image/jpeg\",\n", + " \"png\": \"image/png\",\n", + " \"gif\": \"image/gif\",\n", + " \"webp\": \"image/webp\",\n", + " }\n", + " mime_type = mime_types.get(extension, \"image/jpeg\")\n", + "\n", + " b64_str = base64.b64encode(image_bytes).decode(\"ascii\")\n", + " return f\"data:{mime_type};base64,{b64_str}\"\n", + "\n", + "\n", + "# Upload image file (Colab only)\n", + "if \"google.colab\" in sys.modules:\n", + " from google.colab import files\n", + "\n", + " print(\"Please upload an image file (JPG, PNG, etc.):\")\n", + " uploaded = files.upload()\n", + "\n", + " if uploaded:\n", + " # Get the first uploaded file\n", + " filename = list(uploaded.keys())[0]\n", + " image_bytes = uploaded[filename]\n", + "\n", + " # Encode and generate embedding\n", + " image_base64 = encode_image_base64(image_bytes, filename)\n", + "\n", + " body = {\n", + " \"model\": \"voyage-multimodal-3.5\",\n", + " \"inputs\": [\n", + " {\"content\": [{\"type\": \"image_base64\", \"image_base64\": image_base64}]}\n", + " ],\n", + " \"input_type\": \"document\",\n", + " }\n", + "\n", + " response = endpoint.invoke(\n", + " request_path=\"/multimodalembeddings\",\n", + " body=json.dumps(body).encode(\"utf-8\"),\n", + " headers={\"Content-Type\": \"application/json\"},\n", + " )\n", + "\n", + " result = response.json()\n", + " embedding = result[\"data\"][0][\"embedding\"]\n", + "\n", + " print(f\"\\nEmbedding dimension: {len(embedding)}\")\n", + " print(f\"Embedding (first 5 values): {embedding[:5]}\")\n", + " print(f\"\\nUsage: {result.get('usage', {})}\")\n", + "else:\n", + " print(\"File upload is only available in Google Colab.\")\n", + " print(\n", + " \"For other environments, use the encode_image_base64() helper function with file bytes.\"\n", + " )" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "62f2249de9dd" + }, + "source": [ + "### Video embeddings\n", + "\n", + "Generate embeddings for video content. Videos must be:\n", + "- **Format**: MP4 container\n", + "- **Size**: Maximum 20 MB\n", + "- **Frames**: At least 2 frames\n", + "\n", + "#### Using video URLs" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "0aca2573178b" + }, + "outputs": [], + "source": [ + "import json\n", + "\n", + "# Example video URL (Cooking video, ~500kb)\n", + "video_url = \"https://file.garden/aTiKu4GB_i5vfop6/example_video_01.mp4\"\n", + "\n", + "# Format input with video URL\n", + "inputs = [{\"content\": [{\"type\": \"video_url\", \"video_url\": video_url}]}]\n", + "\n", + "body = {\"model\": \"voyage-multimodal-3.5\", \"inputs\": inputs, \"input_type\": \"document\"}\n", + "\n", + "response = endpoint.invoke(\n", + " request_path=\"/multimodalembeddings\",\n", + " body=json.dumps(body).encode(\"utf-8\"),\n", + " headers={\"Content-Type\": \"application/json\"},\n", + ")\n", + "\n", + "result = response.json()\n", + "video_embedding = result[\"data\"][0][\"embedding\"]\n", + "usage = result.get(\"usage\", {})\n", + "\n", + "print(f\"Embedding dimension: {len(video_embedding)}\")\n", + "print(f\"Embedding (first 5 values): {video_embedding[:5]}\")\n", + "print(\"\\nUsage:\")\n", + "print(f\" Total tokens: {usage.get('total_tokens')}\")\n", + "print(f\" Video pixels: {usage.get('video_pixels')}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "bc129e861b42" + }, + "source": [ + "#### Using base64-encoded videos\n", + "\n", + "For local videos, use Google Colab's file upload interface:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "c9fa7e05270e" + }, + "outputs": [], + "source": [ + "import json\n", + "import sys\n", + "\n", + "\n", + "def encode_video_base64(video_bytes: bytes) -> str:\n", + " \"\"\"Encode video bytes as a base64 data URI.\"\"\"\n", + " b64_str = base64.b64encode(video_bytes).decode(\"ascii\")\n", + " return f\"data:video/mp4;base64,{b64_str}\"\n", + "\n", + "\n", + "# Upload video file (Colab only)\n", + "if \"google.colab\" in sys.modules:\n", + " from google.colab import files\n", + "\n", + " print(\"Please upload an MP4 video file (max 20 MB):\")\n", + " uploaded = files.upload()\n", + "\n", + " if uploaded:\n", + " # Get the first uploaded file\n", + " filename = list(uploaded.keys())[0]\n", + " video_bytes = uploaded[filename]\n", + "\n", + " file_size_mb = len(video_bytes) / (1024 * 1024)\n", + " print(f\"\\nUploaded: {filename} ({file_size_mb:.2f} MB)\")\n", + "\n", + " if file_size_mb > 20:\n", + " print(\"Warning: File exceeds 20 MB limit and may be rejected by the API.\")\n", + "\n", + " # Encode and generate embedding\n", + " video_base64 = encode_video_base64(video_bytes)\n", + "\n", + " body = {\n", + " \"model\": \"voyage-multimodal-3.5\",\n", + " \"inputs\": [\n", + " {\"content\": [{\"type\": \"video_base64\", \"video_base64\": video_base64}]}\n", + " ],\n", + " \"input_type\": \"document\",\n", + " }\n", + "\n", + " response = endpoint.invoke(\n", + " request_path=\"/multimodalembeddings\",\n", + " body=json.dumps(body).encode(\"utf-8\"),\n", + " headers={\"Content-Type\": \"application/json\"},\n", + " )\n", + "\n", + " result = response.json()\n", + " embedding = result[\"data\"][0][\"embedding\"]\n", + " usage = result.get(\"usage\", {})\n", + "\n", + " print(f\"\\nEmbedding dimension: {len(embedding)}\")\n", + " print(f\"Embedding (first 5 values): {embedding[:5]}\")\n", + " print(f\"Total tokens: {usage.get('total_tokens')}\")\n", + " print(f\"Video pixels: {usage.get('video_pixels')}\")\n", + "else:\n", + " print(\"File upload is only available in Google Colab.\")\n", + " print(\n", + " \"For other environments, use the encode_video_base64() helper function with video bytes.\"\n", + " )" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "43d7d809f46d" + }, + "source": [ + "### Multimodal embeddings (text + images + video)\n", + "\n", + "A key feature of Voyage Multimodal 3.5 is the ability to create embeddings from interleaved text, images, and video. This is useful for rich documents that combine multiple modalities." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "062b0a413e05" + }, + "outputs": [], + "source": [ + "import json\n", + "\n", + "# Create a multimodal input combining text, image, and video\n", + "multimodal_input = {\n", + " \"content\": [\n", + " {\"type\": \"text\", \"text\": \"This is a banana.\"},\n", + " {\n", + " \"type\": \"image_url\",\n", + " \"image_url\": \"https://raw.githubusercontent.com/voyage-ai/voyage-multimodal-3/refs/heads/main/images/banana.jpg\",\n", + " },\n", + " {\n", + " \"type\": \"video_url\",\n", + " \"video_url\": \"https://file.garden/aTiKu4GB_i5vfop6/example_video_01.mp4\",\n", + " },\n", + " ]\n", + "}\n", + "\n", + "body = {\n", + " \"model\": \"voyage-multimodal-3.5\",\n", + " \"inputs\": [multimodal_input],\n", + " \"input_type\": \"document\",\n", + "}\n", + "\n", + "response = endpoint.invoke(\n", + " request_path=\"/multimodalembeddings\",\n", + " body=json.dumps(body).encode(\"utf-8\"),\n", + " headers={\"Content-Type\": \"application/json\"},\n", + ")\n", + "\n", + "result = response.json()\n", + "multimodal_embedding = result[\"data\"][0][\"embedding\"]\n", + "usage = result.get(\"usage\", {})\n", + "\n", + "print(f\"Multimodal embedding dimension: {len(multimodal_embedding)}\")\n", + "print(f\"Embedding (first 5 values): {multimodal_embedding[:5]}\")\n", + "print(\"\\nUsage:\")\n", + "print(f\" Text tokens: {usage.get('text_tokens')}\")\n", + "print(f\" Image pixels: {usage.get('image_pixels')}\")\n", + "print(f\" Video pixels: {usage.get('video_pixels')}\")\n", + "print(f\" Total tokens: {usage.get('total_tokens')}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "b977262dff32" + }, + "source": [ + "### Cross-modal semantic similarity\n", + "\n", + "One of the most powerful features of multimodal embeddings is the ability to search across modalities. You can use a text query to find relevant images or videos." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "0c44772f12f0" + }, + "outputs": [], + "source": [ + "import json\n", + "\n", + "import numpy as np\n", + "\n", + "\n", + "def cosine_similarity(vec1, vec2):\n", + " \"\"\"Calculate cosine similarity between two vectors.\"\"\"\n", + " vec1 = np.array(vec1)\n", + " vec2 = np.array(vec2)\n", + " return np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))\n", + "\n", + "\n", + "# Text queries\n", + "queries = [\"A yellow fruit\", \"A green vegetable\"]\n", + "\n", + "# Documents to search (image and video)\n", + "documents = [\n", + " {\n", + " \"type\": \"image_url\",\n", + " \"image_url\": \"https://raw.githubusercontent.com/voyage-ai/voyage-multimodal-3/refs/heads/main/images/banana.jpg\",\n", + " \"description\": \"Banana image\",\n", + " },\n", + " {\n", + " \"type\": \"video_url\",\n", + " \"video_url\": \"https://file.garden/aTiKu4GB_i5vfop6/example_video_01.mp4\",\n", + " \"description\": \"Cooking video\",\n", + " },\n", + "]\n", + "\n", + "# Get document embeddings once (use input_type=\"document\" for documents to be searched)\n", + "doc_inputs = []\n", + "for doc in documents:\n", + " media_type = doc[\"type\"]\n", + " media_url = doc[media_type]\n", + " content_item = {\"type\": media_type, media_type: media_url}\n", + " doc_inputs.append({\"content\": [content_item]})\n", + "\n", + "doc_body = {\n", + " \"model\": \"voyage-multimodal-3.5\",\n", + " \"inputs\": doc_inputs,\n", + " \"input_type\": \"document\",\n", + "}\n", + "doc_response = endpoint.invoke(\n", + " request_path=\"/multimodalembeddings\",\n", + " body=json.dumps(doc_body).encode(\"utf-8\"),\n", + " headers={\"Content-Type\": \"application/json\"},\n", + ")\n", + "doc_embeddings = [item[\"embedding\"] for item in doc_response.json()[\"data\"]]\n", + "\n", + "# Test each query against the documents\n", + "for query_text in queries:\n", + " # Get query embedding (use input_type=\"query\" for search queries)\n", + " query_body = {\n", + " \"model\": \"voyage-multimodal-3.5\",\n", + " \"inputs\": [{\"content\": [{\"type\": \"text\", \"text\": query_text}]}],\n", + " \"input_type\": \"query\",\n", + " }\n", + " query_response = endpoint.invoke(\n", + " request_path=\"/multimodalembeddings\",\n", + " body=json.dumps(query_body).encode(\"utf-8\"),\n", + " headers={\"Content-Type\": \"application/json\"},\n", + " )\n", + " query_embedding = query_response.json()[\"data\"][0][\"embedding\"]\n", + "\n", + " # Calculate cross-modal similarities\n", + " print(f'Query: \"{query_text}\"')\n", + " print(\"Cross-modal similarity scores:\")\n", + " for doc, embedding in zip(documents, doc_embeddings):\n", + " similarity = cosine_similarity(query_embedding, embedding)\n", + " print(f\" {similarity:.4f} - {doc['description']}\")\n", + " print()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "02eaf30726e8" + }, + "source": [ + "## Advanced parameters\n", + "\n", + "Voyage Multimodal 3.5 supports several parameters to customize embedding generation." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "3de91676bceb" + }, + "source": [ + "### Understanding input_type: Query vs Document\n", + "\n", + "The `input_type` parameter optimizes embeddings for retrieval tasks:\n", + "\n", + "* **`query`**: Use this when the input represents a search query. The model prepends \"Represent the query for retrieving supporting documents: \" to optimize for retrieval.\n", + "* **`document`**: Use this when the input represents content to be indexed. The model prepends \"Represent the document for retrieval: \" to optimize for indexing.\n", + "* **`null`** (default): No special prompt is added. Use for general-purpose embeddings.\n", + "\n", + "**Best Practice**: For retrieval applications, use `input_type=\"query\"` for search queries and `input_type=\"document\"` for the content you're indexing. Embeddings generated with and without the `input_type` argument are compatible." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "aff33ce1d594" + }, + "outputs": [], + "source": [ + "import json\n", + "\n", + "# Example: Different input types for retrieval\n", + "search_query = \"What does a banana look like?\"\n", + "\n", + "# Query embedding (for search)\n", + "query_body = {\n", + " \"model\": \"voyage-multimodal-3.5\",\n", + " \"inputs\": [{\"content\": [{\"type\": \"text\", \"text\": search_query}]}],\n", + " \"input_type\": \"query\", # Optimized for search queries\n", + "}\n", + "query_response = endpoint.invoke(\n", + " request_path=\"/multimodalembeddings\",\n", + " body=json.dumps(query_body).encode(\"utf-8\"),\n", + " headers={\"Content-Type\": \"application/json\"},\n", + ")\n", + "query_result = query_response.json()\n", + "\n", + "# Document embedding (for indexing)\n", + "doc_body = {\n", + " \"model\": \"voyage-multimodal-3.5\",\n", + " \"inputs\": [{\"content\": [{\"type\": \"text\", \"text\": search_query}]}],\n", + " \"input_type\": \"document\", # Optimized for documents\n", + "}\n", + "doc_response = endpoint.invoke(\n", + " request_path=\"/multimodalembeddings\",\n", + " body=json.dumps(doc_body).encode(\"utf-8\"),\n", + " headers={\"Content-Type\": \"application/json\"},\n", + ")\n", + "doc_result = doc_response.json()\n", + "\n", + "# General-purpose embedding (no input_type)\n", + "general_body = {\n", + " \"model\": \"voyage-multimodal-3.5\",\n", + " \"inputs\": [{\"content\": [{\"type\": \"text\", \"text\": search_query}]}],\n", + " # input_type defaults to null\n", + "}\n", + "general_response = endpoint.invoke(\n", + " request_path=\"/multimodalembeddings\",\n", + " body=json.dumps(general_body).encode(\"utf-8\"),\n", + " headers={\"Content-Type\": \"application/json\"},\n", + ")\n", + "general_result = general_response.json()\n", + "\n", + "print(f'Text: \"{search_query}\"\\n')\n", + "print(f\"Query embedding (first 5): {query_result['data'][0]['embedding'][:5]}\")\n", + "print(f\"Document embedding (first 5): {doc_result['data'][0]['embedding'][:5]}\")\n", + "print(f\"General embedding (first 5): {general_result['data'][0]['embedding'][:5]}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "e50706d6244a" + }, + "source": [ + "### Truncation\n", + "\n", + "The `truncation` parameter controls how the model handles inputs that exceed the context window (32,000 tokens):\n", + "\n", + "* **`true`** (default): Automatically truncate inputs that exceed the context limit. If truncation happens in the middle of an image, the entire image will be discarded.\n", + "* **`false`**: Return an error if any input exceeds the context limit.\n", + "\n", + "When truncation occurs, you may see a warning in the response headers." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "d3710b176d36" + }, + "outputs": [], + "source": [ + "import json\n", + "\n", + "# Example: Create input that exceeds context limit to trigger truncation\n", + "# We'll repeat a video URL multiple times to exceed 32k tokens\n", + "video_url = \"https://file.garden/aTiKu4GB_i5vfop6/example_video_01.mp4\"\n", + "\n", + "# Create input with 4 videos (should exceed 32k token limit)\n", + "truncation_input = {\n", + " \"content\": [\n", + " {\"type\": \"video_url\", \"video_url\": video_url},\n", + " {\"type\": \"video_url\", \"video_url\": video_url},\n", + " {\"type\": \"video_url\", \"video_url\": video_url},\n", + " {\"type\": \"video_url\", \"video_url\": video_url},\n", + " ]\n", + "}\n", + "\n", + "body = {\n", + " \"model\": \"voyage-multimodal-3.5\",\n", + " \"inputs\": [truncation_input],\n", + " \"input_type\": \"document\",\n", + " \"truncation\": True, # Enable automatic truncation (this is the default)\n", + "}\n", + "\n", + "response = endpoint.invoke(\n", + " request_path=\"/multimodalembeddings\",\n", + " body=json.dumps(body).encode(\"utf-8\"),\n", + " headers={\"Content-Type\": \"application/json\"},\n", + ")\n", + "\n", + "result = response.json()\n", + "usage = result.get(\"usage\", {})\n", + "\n", + "print(\"Embedding generated with truncation enabled\")\n", + "print(f\"Dimension: {len(result['data'][0]['embedding'])}\")\n", + "print(\"\\nUsage:\")\n", + "print(f\" Total tokens: {usage.get('total_tokens')}\")\n", + "print(f\" Video pixels: {usage.get('video_pixels')}\")\n", + "\n", + "# Check response headers for truncation warning\n", + "if hasattr(response, \"headers\"):\n", + " warning = response.headers.get(\"x-api-warning\", response.headers.get(\"warning\"))\n", + " if warning:\n", + " print(f\"\\nTruncation warning: {warning}\")\n", + " else:\n", + " print(\"\\nNo truncation warning detected (may have fit within limit)\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "991a709ced51" + }, + "source": [ + "### Output encoding\n", + "\n", + "The `output_encoding` parameter controls the format of the embedding output:\n", + "\n", + "* **`null`** (default): Embeddings are returned as a list of floating-point numbers.\n", + "* **`base64`**: Embeddings are returned as a Base64-encoded string representing a NumPy array of single-precision floats. This can be more efficient for large batch operations." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "d1a982c1fd55" + }, + "outputs": [], + "source": [ + "import json\n", + "\n", + "text = \"A beautiful landscape photo.\"\n", + "\n", + "# Default output (list of floats)\n", + "default_body = {\n", + " \"model\": \"voyage-multimodal-3.5\",\n", + " \"inputs\": [{\"content\": [{\"type\": \"text\", \"text\": text}]}],\n", + " \"input_type\": \"document\",\n", + "}\n", + "default_response = endpoint.invoke(\n", + " request_path=\"/multimodalembeddings\",\n", + " body=json.dumps(default_body).encode(\"utf-8\"),\n", + " headers={\"Content-Type\": \"application/json\"},\n", + ")\n", + "default_embedding = default_response.json()[\"data\"][0][\"embedding\"]\n", + "\n", + "# Base64-encoded output\n", + "base64_body = {\n", + " \"model\": \"voyage-multimodal-3.5\",\n", + " \"inputs\": [{\"content\": [{\"type\": \"text\", \"text\": text}]}],\n", + " \"input_type\": \"document\",\n", + " \"output_encoding\": \"base64\",\n", + "}\n", + "base64_response = endpoint.invoke(\n", + " request_path=\"/multimodalembeddings\",\n", + " body=json.dumps(base64_body).encode(\"utf-8\"),\n", + " headers={\"Content-Type\": \"application/json\"},\n", + ")\n", + "base64_embedding = base64_response.json()[\"data\"][0][\"embedding\"]\n", + "\n", + "# Decode the base64 embedding\n", + "decoded_embedding = np.frombuffer(base64.b64decode(base64_embedding), dtype=np.float32)\n", + "\n", + "print(\"Default output (list of floats):\")\n", + "print(f\" Type: {type(default_embedding)}\")\n", + "print(f\" Length: {len(default_embedding)}\")\n", + "print(f\" First 5 values: {default_embedding[:5]}\")\n", + "\n", + "print(\"\\nBase64 output:\")\n", + "print(f\" Type: {type(base64_embedding)}\")\n", + "print(f\" Length: {len(base64_embedding)} characters\")\n", + "print(f\" Decoded length: {len(decoded_embedding)}\")\n", + "print(f\" Decoded first 5 values: {decoded_embedding[:5].tolist()}\")\n", + "\n", + "# Verify they match\n", + "print(f\"\\nEmbeddings match: {np.allclose(default_embedding, decoded_embedding)}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "d256ca3e497d" + }, + "source": [ + "### Using different output dimensions\n", + "\n", + "Voyage Multimodal 3.5 supports multiple output dimensions: 256, 512, 1024 (default), and 2048. Smaller dimensions reduce storage and computation costs, while larger dimensions may provide better accuracy." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "edcfb14b5b50" + }, + "outputs": [], + "source": [ + "import json\n", + "\n", + "# Use an image URL for testing different dimensions\n", + "image_url = \"https://raw.githubusercontent.com/voyage-ai/voyage-multimodal-3/refs/heads/main/images/banana.jpg\"\n", + "\n", + "# Test different output dimensions\n", + "dimensions = [256, 512, 1024, 2048]\n", + "\n", + "print(\"Comparing different output dimensions:\\n\")\n", + "for dim in dimensions:\n", + " body = {\n", + " \"model\": \"voyage-multimodal-3.5\",\n", + " \"inputs\": [\n", + " {\n", + " \"content\": [\n", + " {\n", + " \"type\": \"text\",\n", + " \"text\": \"A photo of a banana on a white background.\",\n", + " },\n", + " {\"type\": \"image_url\", \"image_url\": image_url},\n", + " ]\n", + " }\n", + " ],\n", + " \"output_dimension\": dim,\n", + " \"input_type\": \"document\",\n", + " }\n", + " response = endpoint.invoke(\n", + " request_path=\"/multimodalembeddings\",\n", + " body=json.dumps(body).encode(\"utf-8\"),\n", + " headers={\"Content-Type\": \"application/json\"},\n", + " )\n", + " result = response.json()\n", + " embedding = result[\"data\"][0][\"embedding\"]\n", + "\n", + " print(f\"Dimension {dim}:\")\n", + " print(f\" Length: {len(embedding)}\")\n", + " print(f\" First 5 values: {embedding[:5]}\")\n", + " print(f\" Storage size: ~{len(embedding) * 4} bytes (float32)\\n\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "a9898eb27810" + }, + "source": [ + "### Using different output data types\n", + "\n", + "Voyage Multimodal 3.5 supports multiple output data types to optimize for storage and performance:\n", + "\n", + "* **`float`** (default): 32-bit floating-point numbers, highest precision\n", + "* **`int8`**: 8-bit signed integers (-128 to 127), 4x smaller than float\n", + "* **`uint8`**: 8-bit unsigned integers (0 to 255), 4x smaller than float\n", + "* **`binary`**: Bit-packed signed integers (int8), 32x smaller than float\n", + "* **`ubinary`**: Bit-packed unsigned integers (uint8), 32x smaller than float\n", + "\n", + "Quantized formats (int8, uint8, binary, ubinary) trade some precision for significant storage savings." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "e84d93b0bb13" + }, + "outputs": [], + "source": [ + "import json\n", + "\n", + "# Use an image URL for testing different data types\n", + "image_url = \"https://raw.githubusercontent.com/voyage-ai/voyage-multimodal-3/refs/heads/main/images/banana.jpg\"\n", + "\n", + "# Test different output data types\n", + "output_dtypes = [\"float\", \"int8\", \"uint8\", \"binary\", \"ubinary\"]\n", + "\n", + "print(\"Comparing different output data types:\\n\")\n", + "for dtype in output_dtypes:\n", + " body = {\n", + " \"model\": \"voyage-multimodal-3.5\",\n", + " \"inputs\": [\n", + " {\n", + " \"content\": [\n", + " {\n", + " \"type\": \"text\",\n", + " \"text\": \"A photo of a banana on a white background.\",\n", + " },\n", + " {\"type\": \"image_url\", \"image_url\": image_url},\n", + " ]\n", + " }\n", + " ],\n", + " \"output_dimension\": 1024,\n", + " \"output_dtype\": dtype,\n", + " \"input_type\": \"document\",\n", + " }\n", + " response = endpoint.invoke(\n", + " request_path=\"/multimodalembeddings\",\n", + " body=json.dumps(body).encode(\"utf-8\"),\n", + " headers={\"Content-Type\": \"application/json\"},\n", + " )\n", + " result = response.json()\n", + " embedding = result[\"data\"][0][\"embedding\"]\n", + "\n", + " # Calculate actual storage size\n", + " if dtype == \"float\":\n", + " storage_bytes = len(embedding) * 4 # 4 bytes per float32\n", + " elif dtype in [\"int8\", \"uint8\"]:\n", + " storage_bytes = len(embedding) * 1 # 1 byte per int8/uint8\n", + " elif dtype in [\"binary\", \"ubinary\"]:\n", + " storage_bytes = len(embedding) * 1 # bit-packed, 1/8 of dimension\n", + "\n", + " print(f\"Output dtype: {dtype}\")\n", + " print(f\" Length: {len(embedding)}\")\n", + " print(f\" Value type: {type(embedding[0]).__name__}\")\n", + " print(f\" First 5 values: {embedding[:5]}\")\n", + " print(f\" Storage size: ~{storage_bytes} bytes\")\n", + "\n", + " # Calculate compression ratio vs float\n", + " if dtype != \"float\":\n", + " compression_ratio = (1024 * 4) / storage_bytes\n", + " print(f\" Compression: {compression_ratio:.1f}x smaller than float\")\n", + " print()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "a53ec669e084" + }, + "source": [ + "### Combining output_dimension and output_dtype\n", + "\n", + "You can combine different dimensions and data types to optimize for your use case.\n", + "\n", + "Please refer to our guide for details on [offset binary](https://docs.voyageai.com/docs/flexible-dimensions-and-quantization#offset-binary) and [binary embeddings](https://docs.voyageai.com/docs/flexible-dimensions-and-quantization#quantization)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "0da84ee9f088" + }, + "outputs": [], + "source": [ + "import json\n", + "\n", + "# Use an image URL for the comparison\n", + "image_url = \"https://raw.githubusercontent.com/voyage-ai/voyage-multimodal-3/refs/heads/main/images/banana.jpg\"\n", + "\n", + "# Example: Ultra-compact embeddings (256 dimensions + ubinary)\n", + "compact_body = {\n", + " \"model\": \"voyage-multimodal-3.5\",\n", + " \"inputs\": [\n", + " {\n", + " \"content\": [\n", + " {\"type\": \"text\", \"text\": \"A photo of a banana on a white background.\"},\n", + " {\"type\": \"image_url\", \"image_url\": image_url},\n", + " ]\n", + " }\n", + " ],\n", + " \"output_dimension\": 256,\n", + " \"output_dtype\": \"ubinary\", # Most compact format\n", + " \"input_type\": \"document\",\n", + "}\n", + "compact_response = endpoint.invoke(\n", + " request_path=\"/multimodalembeddings\",\n", + " body=json.dumps(compact_body).encode(\"utf-8\"),\n", + " headers={\"Content-Type\": \"application/json\"},\n", + ")\n", + "compact_result = compact_response.json()\n", + "compact_embedding = compact_result[\"data\"][0][\"embedding\"]\n", + "\n", + "# Example: High-precision embeddings (2048 dimensions + float)\n", + "precise_body = {\n", + " \"model\": \"voyage-multimodal-3.5\",\n", + " \"inputs\": [\n", + " {\n", + " \"content\": [\n", + " {\"type\": \"text\", \"text\": \"A photo of a banana on a white background.\"},\n", + " {\"type\": \"image_url\", \"image_url\": image_url},\n", + " ]\n", + " }\n", + " ],\n", + " \"output_dimension\": 2048,\n", + " \"output_dtype\": \"float\", # Highest precision\n", + " \"input_type\": \"document\",\n", + "}\n", + "precise_response = endpoint.invoke(\n", + " request_path=\"/multimodalembeddings\",\n", + " body=json.dumps(precise_body).encode(\"utf-8\"),\n", + " headers={\"Content-Type\": \"application/json\"},\n", + ")\n", + "precise_result = precise_response.json()\n", + "precise_embedding = precise_result[\"data\"][0][\"embedding\"]\n", + "\n", + "# Compare storage requirements\n", + "compact_storage = len(compact_embedding) * 1 # binary is bit-packed\n", + "precise_storage = len(precise_embedding) * 4 # float32\n", + "\n", + "print(\"Storage comparison:\\n\")\n", + "print(\"Ultra-compact (256-dim ubinary):\")\n", + "print(\" Dimension: 256\")\n", + "print(f\" Storage: ~{compact_storage} bytes\")\n", + "print(f\" First 5 values: {compact_embedding[:5]}\\n\")\n", + "\n", + "print(\"High-precision (2048-dim float):\")\n", + "print(f\" Dimension: {len(precise_embedding)}\")\n", + "print(f\" Storage: ~{precise_storage} bytes\")\n", + "print(f\" First 5 values: {precise_embedding[:5]}\\n\")\n", + "\n", + "print(f\"Storage ratio: {precise_storage / compact_storage:.1f}x\")\n", + "print(\"\\nFor 1 million vectors:\")\n", + "print(f\" Ultra-compact: ~{compact_storage * 1_000_000 / (1024**2):.1f} MB\")\n", + "print(f\" High-precision: ~{precise_storage * 1_000_000 / (1024**2):.1f} MB\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "435b202aa7d7" + }, + "source": [ + "## Cleaning up\n", + "\n", + "To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, delete the endpoint and undeploy the model." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "e8a61891590c" + }, + "outputs": [], + "source": [ + "# Delete the endpoint (this will also undeploy all models)\n", + "print(f\"Deleting endpoint: {endpoint.display_name}\")\n", + "endpoint.delete(force=True)\n", + "print(\"Endpoint deleted successfully!\")" + ] + } + ], + "metadata": { + "colab": { + "name": "voyage-multimodal-3.5.ipynb", + "toc_visible": true + }, + "kernelspec": { + "display_name": "Python 3", + "name": "python3" + } + }, + "nbformat": 4, + "nbformat_minor": 0 +}