Add Jupyter notebook implementation of Multimodal RAG

elastic · Feb 27, 2025 · fc2b80d · fc2b80d
1 parent e24ca5b
commit fc2b80d
Showing 1 changed file with 373 additions and 0 deletions.
diff --git a/...uilding-multimodal-rag-with-elasticsearch-gotham/notebook/01-mmrag-blog-quick-start.ipynb b/...uilding-multimodal-rag-with-elasticsearch-gotham/notebook/01-mmrag-blog-quick-start.ipynb
@@ -0,0 +1,373 @@
+{
+  "nbformat": 4,
+  "nbformat_minor": 0,
+  "metadata": {
+    "colab": {
+      "provenance": []
+    },
+    "kernelspec": {
+      "name": "python3",
+      "display_name": "Python 3"
+    },
+    "language_info": {
+      "name": "python"
+    }
+  },
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "source": [
+        "\n",
+        "\n",
+        "# Multimodal RAG with Elasticsearch: The Gotham City Case\n",
+        "\n"
+      ],
+      "metadata": {
+        "id": "dGVterhZUeb7"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "This notebook implements the Multimodal RAG (Retrieval-Augmented Generation) pipeline with Elasticsearch as described in the blog. We follow the same structure as the article, with each section explained and implemented in code.\n",
+        "\n",
+        "## Environment Setup\n",
+        "\n",
+        "First, we need to clone the repository that contains the complete project code."
+      ],
+      "metadata": {
+        "id": "JGuNiw7hUc6M"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "# Clone do repositório específico com a branch feature/multimodal-rag-gotham\n",
+        "!git clone -b feature/multimodal-rag-gotham https://github.com/salgado/elasticsearch-labs.git"
+      ],
+      "metadata": {
+        "id": "UM5x0n2iA7o2"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "Let's navigate to the project directory where the necessary files are located:\n"
+      ],
+      "metadata": {
+        "id": "e6mW8JNyVdzi"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "cd elasticsearch-labs/supporting-blog-content/building-multimodal-rag-with-elasticsearch-gotham"
+      ],
+      "metadata": {
+        "id": "PHrDQc0jOOb7"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "Now let's configure the environment variables needed to connect to Elasticsearch and OpenAI. This is necessary for indexing and searching content, as well as generating the final report.\n"
+      ],
+      "metadata": {
+        "id": "LAGB159_Uaxb"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "ELASTICSEARCH_URL = input(\"Enter the Elasticsearch endpoint url: \")\n",
+        "ELASTICSEARCH_API_KEY = getpass.getpass(\"Enter the Elasticsearch API key: \")\n",
+        "\n",
+        "OPENAI_API_KEY = getpass.getpass(\"Enter the OpenAI API key: \")"
+      ],
+      "metadata": {
+        "id": "U8IuJRQhS7lz"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "import os\n",
+        "os.environ[\"ELASTICSEARCH_API_KEY\"] = ELASTICSEARCH_API_KEY\n",
+        "os.environ[\"OPENAI_API_KEY\"] = OPENAI_API_KEY\n",
+        "os.environ[\"ELASTICSEARCH_URL\"] = ELASTICSEARCH_URL\n"
+      ],
+      "metadata": {
+        "id": "ZC4v_SHjMwLa"
+      },
+      "execution_count": 24,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "\n",
+        "## Installing Dependencies\n",
+        "\n",
+        "As mentioned in the blog, we need to install the specific dependencies, including the custom ImageBind fork:\n"
+      ],
+      "metadata": {
+        "id": "RNRExs7aVl45"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "# Install base dependencies\n",
+        "!pip install torch>=2.1.0 torchvision>=0.16.0 torchaudio>=2.1.0\n",
+        "!pip install opencv-python-headless pillow numpy\n",
+        "\n",
+        "# Install the specific ImageBind fork\n",
+        "!pip install git+https://github.com/hkchengrex/ImageBind.git"
+      ],
+      "metadata": {
+        "id": "FhPcJYl03eNL"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "!pip -q install elasticsearch"
+      ],
+      "metadata": {
+        "id": "LISqDRmE8PpG"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "!pip install python-dotenv"
+      ],
+      "metadata": {
+        "id": "GGIFHatG9BTP"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "## Stage 1 - Collecting Crime Scene Clues\n",
+        "\n",
+        "As explained in the blog, the first step is to verify that we have the correct directory structure and that the evidence files are present. We use `files_check.py` for this.\n"
+      ],
+      "metadata": {
+        "id": "jJt01mAeYaOT"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "!python stages/01-stage/files_check.py"
+      ],
+      "metadata": {
+        "id": "rZJexfwR4FaT"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "## Stage 2 - Generating Embeddings with ImageBind\n",
+        "\n",
+        "Now we test the embedding generation for an image using ImageBind. As the blog explains, ImageBind allows us to generate embeddings for different modalities (image, audio, text) in a shared vector space.\n"
+      ],
+      "metadata": {
+        "id": "0a1tNsiGYjEZ"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "!python stages/02-stage/test_embedding_generation.py"
+      ],
+      "metadata": {
+        "id": "A6C9IIuA6dlH"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "This script generates a 1024-dimensional embedding for a test image, confirming that the ImageBind model is working correctly.\n"
+      ],
+      "metadata": {
+        "id": "Vw5xlFXgYls4"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "\n",
+        "## Stage 3 - Storage and Search in Elasticsearch\n",
+        "\n",
+        "### Content Indexing\n",
+        "\n",
+        "The next step is to index all multimodal evidence in Elasticsearch. This includes images, audio, text, and depth maps as described in the blog."
+      ],
+      "metadata": {
+        "id": "Q2dsScL5ZF0X"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "!python stages/03-stage/index_all_modalities.py"
+      ],
+      "metadata": {
+        "id": "3nBsEf7u60bq"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "\n",
+        "Each piece of evidence is now indexed in Elasticsearch with their respective embeddings, allowing for similarity search.\n",
+        "\n",
+        "### Searching by Similarity Across Different Modalities\n",
+        "\n",
+        "Now we can test searching for evidence by similarity using different modalities as queries. The blog describes how an input from one modality can retrieve results from all modalities.\n",
+        "\n",
+        "#### Search by Audio\n"
+      ],
+      "metadata": {
+        "id": "Tf-8U-CGZXxW"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "!python stages/03-stage/search_by_audio.py"
+      ],
+      "metadata": {
+        "id": "7f-MBkFALphP"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "\n",
+        "This command uses an audio file as a query and retrieves the most similar evidence. In the case of Gotham, this helps identify connections between the audio of a sinister laugh and other evidence.\n",
+        "\n",
+        "#### Search by Text"
+      ],
+      "metadata": {
+        "id": "nrGUO1JVZZnz"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "!python stages/03-stage/search_by_text.py"
+      ],
+      "metadata": {
+        "id": "mm_RwbfYQBGK"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "\n",
+        "Here we use a text query (\"Why so serious?\") to find related evidence.\n",
+        "\n",
+        "#### Search by Image\n"
+      ],
+      "metadata": {
+        "id": "YXhvE2EbZgQt"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "!python stages/03-stage/search_by_image.py"
+      ],
+      "metadata": {
+        "id": "jrOBYZwtQQng"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "This script uses an image from the crime scene to find similar visual evidence.\n",
+        "\n",
+        "#### Search by Depth Map\n"
+      ],
+      "metadata": {
+        "id": "V2Ut2whVZm3s"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "!python stages/03-stage/search_by_depth.py"
+      ],
+      "metadata": {
+        "id": "Bbm1vWfXQiPZ"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "As explained in the blog, depth maps can provide information about the 3D structure of the scene or objects, complementing the other modalities.\n",
+        "\n",
+        "## Stage 4 - Evidence Analysis with LLM\n",
+        "\n",
+        "Finally, we bring together all the retrieved evidence and use an LLM (GPT-4) to generate a forensic report that identifies the suspect based on the connections between the different modalities.\n"
+      ],
+      "metadata": {
+        "id": "DWSzg742ZoQw"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "!python stages/04-stage/rag_crime_analyze.py"
+      ],
+      "metadata": {
+        "id": "A8pmOH31Q2Hc"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "\n",
+        "This is the final step of the Multimodal RAG pipeline, where the LLM analyzes the evidence retrieved from Elasticsearch and synthesizes it into a coherent report that identifies the Joker as the main suspect.\n",
+        "\n",
+        "## Conclusion\n",
+        "\n",
+        "We have thus completed the implementation of the complete Multimodal RAG pipeline with Elasticsearch, following all the steps described in the blog. This pipeline demonstrates how different types of media can be analyzed in an integrated way to provide richer insights and connections between evidence that would be difficult to identify manually.\n"
+      ],
+      "metadata": {
+        "id": "VaWriUfjZyUz"
+      }
+    }
+  ]
+}