NFDI4BIOIMAGE
diff --git a/‎Enable_Caching.ipynb
Lines changed: 383 additions & 0 deletions b/‎Enable_Caching.ipynb
Lines changed: 383 additions & 0 deletions
@@ -0,0 +1,383 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "c30cb9b0-405b-48ce-a744-3f811f120a87",
+   "metadata": {},
+   "source": [
+    "## Implement Caching of Text and Visual Embeddings\n",
+    "\n",
+    "In this notebook, we establish a method to cache embeddings. By implementing a persistent cache, we don't need to perform costly calculations over and over again for the same pdfs and slides. We can save a lot of time by storing them, once they were calculated and just fetch the desired outcome if we need it again.\n",
+    "\n",
+    "- caching_local: Calculates embeddings (text and visual) if they are not calculated yet. Results are then stored in a local file using python's shelve module.\n",
+    "- caching_hf: Also calculates embeddings (text and visual) if they are not calculated yet. Results are then stored in a Caching file on Huggingface.\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "93cae03f-4adf-47c6-af18-cdf18b674f3c",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from caching import caching_hf, caching_local"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e1713e3c-68c9-44c2-9925-43ce79d75e80",
+   "metadata": {},
+   "source": [
+    "### 1. Caching the results on the local disc"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "id": "f16dde9b-6c37-427e-9a4b-c7c0da28c44a",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Caching slide 1\n",
+      "Caching slide 2\n",
+      "Caching slide 3\n",
+      "Caching slide 4\n",
+      "Caching slide 5\n",
+      "Caching slide 6\n",
+      "Caching slide 7\n",
+      "Caching slide 8\n",
+      "Caching slide 9\n",
+      "It took 3.67 seconds to calculate the embeddings.\n"
+     ]
+    }
+   ],
+   "source": [
+    "import time\n",
+    "pdf_path = \"WhatIsOMERO.pdf\"  # Path to your PDF\n",
+    "\n",
+    "start_time = time.time()\n",
+    "\n",
+    "caching_local(pdf_path)\n",
+    "\n",
+    "end_time= time.time()\n",
+    "duration= end_time - start_time\n",
+    "print(f'It took {duration:.2f} seconds to calculate the embeddings.')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "9f891dd1-7d08-4dc6-a824-7ecb80064d6e",
+   "metadata": {},
+   "source": [
+    "When performing the same task again, the Embeddings are already stored in the Cache and the calculation should be much faster:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "id": "c47466dd-71ae-4cdb-b83a-6ea488f26b07",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Fetching from cache: Slide 1\n",
+      "Fetching from cache: Slide 2\n",
+      "Fetching from cache: Slide 3\n",
+      "Fetching from cache: Slide 4\n",
+      "Fetching from cache: Slide 5\n",
+      "Fetching from cache: Slide 6\n",
+      "Fetching from cache: Slide 7\n",
+      "Fetching from cache: Slide 8\n",
+      "Fetching from cache: Slide 9\n",
+      "It took 0.01 seconds to fetch the embeddings from the cache.\n"
+     ]
+    }
+   ],
+   "source": [
+    "start_time = time.time()\n",
+    "\n",
+    "caching_local(pdf_path)\n",
+    "\n",
+    "end_time= time.time()\n",
+    "duration= end_time - start_time\n",
+    "print(f'It took {duration:.2f} seconds to fetch the embeddings from the cache.')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "020bd6b5-b62c-4aa3-b597-172ba9128305",
+   "metadata": {},
+   "source": [
+    "### 2. Caching the results online via Huggingface\n",
+    "\n",
+    "You need to install the Huggingface Hub Package first and create an Account on [Huggingface](https://huggingface.co/). You also have to create a [Huggingface Token](https://huggingface.co/docs/hub/security-tokens) and set this as a environment variable. To get more information on how to do that, check out the [ReadMe](https://github.com/NFDI4BIOIMAGE/SlideInsight/blob/main/README.md).\n",
+    "In this example the Data is stored in my Repository on Huggingface."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "id": "50c38072-c190-41e1-93a3-0618844848c3",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Repository 'lea-33/SlightInsight_Cache2' created.\n",
+      "Caching Slide 1\n",
+      "Caching Slide 2\n",
+      "Caching Slide 3\n",
+      "Caching Slide 4\n",
+      "Caching Slide 5\n",
+      "Caching Slide 6\n",
+      "Caching Slide 7\n",
+      "Caching Slide 8\n",
+      "Caching Slide 9\n"
+     ]
+    },
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "ca45505e21374216a03b11131b576a67",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "5fa9955410274ea7a5f4ae59ea2d0ca9",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Finished caching WhatIsOMERO.pdf\n",
+      "It took 7.91 seconds to calculate the embeddings.\n"
+     ]
+    }
+   ],
+   "source": [
+    "repo_name = \"lea-33/SlightInsight_Cache2\"  # Change this to your Hugging Face repository name\n",
+    "\n",
+    "start_time = time.time()\n",
+    "\n",
+    "caching_hf(pdf_path, repo_name)\n",
+    "\n",
+    "end_time= time.time()\n",
+    "duration= end_time - start_time\n",
+    "print(f'It took {duration:.2f} seconds to calculate the embeddings.')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "6c3c5eae-1ed7-44e0-9c46-777bfc70292c",
+   "metadata": {},
+   "source": [
+    "Again, re-calculating the Embeddings should be faster, because they can now be fetched directly from the storage on Huggingface."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "id": "d4f7a6f4-2dba-4912-a8fa-1fd86f7a329d",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Repository 'lea-33/SlightInsight_Cache2' already exists.\n"
+     ]
+    },
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "45ed41f991444173bcfbc1237bdffede",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "Generating train split:   0%|          | 0/9 [00:00<?, ? examples/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Fetching from cache: Slide 1\n",
+      "Fetching from cache: Slide 2\n",
+      "Fetching from cache: Slide 3\n",
+      "Fetching from cache: Slide 4\n",
+      "Fetching from cache: Slide 5\n",
+      "Fetching from cache: Slide 6\n",
+      "Fetching from cache: Slide 7\n",
+      "Fetching from cache: Slide 8\n",
+      "Fetching from cache: Slide 9\n"
+     ]
+    },
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "e5888786c8c4473db71983d77133e77b",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "acf4a0ba48d54fb8b29462f800499a3d",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "No files have been modified since last commit. Skipping to prevent empty commit.\n"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Finished caching WhatIsOMERO.pdf\n",
+      "It took 3.77 seconds to fetch the embeddings from the cache.\n"
+     ]
+    }
+   ],
+   "source": [
+    "start_time = time.time()\n",
+    "\n",
+    "caching_hf(pdf_path, repo_name)\n",
+    "\n",
+    "end_time= time.time()\n",
+    "duration= end_time - start_time\n",
+    "print(f'It took {duration:.2f} seconds to fetch the embeddings from the cache.')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "2bb5574a-5fe6-49b0-9cb4-f0e307ef8955",
+   "metadata": {},
+   "source": [
+    "### 3. Load the Dataset from Cache and convert it to a pandas DataFrame for easy processing"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "id": "cd042750-3954-42bf-af31-96d94a9607f6",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Dataset Preview:\n",
+      "                      key                                              value\n",
+      "0  WhatIsOMERO.pdf_slide1  {'text_embedding': [0.4003332853317261, -0.336...\n",
+      "1  WhatIsOMERO.pdf_slide2  {'text_embedding': [0.39082658290863037, -0.28...\n",
+      "2  WhatIsOMERO.pdf_slide3  {'text_embedding': [0.18631458282470703, -0.37...\n",
+      "3  WhatIsOMERO.pdf_slide4  {'text_embedding': [0.18063969910144806, -0.60...\n",
+      "4  WhatIsOMERO.pdf_slide5  {'text_embedding': [-0.44303596019744873, -0.5...\n",
+      "\n",
+      "First Text Embedding:\n",
+      "[0.4003332853317261, -0.33649125695228577, 0.3998110592365265, -0.4730990529060364, -0.5025672316551208, 0.12307340651750565, -0.24336643517017365, -0.3277848958969116, 0.29507237672805786, 0.5909251570701599] ...\n",
+      "\n",
+      "First Vision Embedding:\n",
+      "[-0.037381067872047424, 0.4586034417152405, 0.020449191331863403, 0.13002845644950867, 0.3475934863090515, -0.14490166306495667, -0.16358992457389832, 0.13041885197162628, -0.04649023711681366, 0.08413688838481903] ...\n"
+     ]
+    }
+   ],
+   "source": [
+    "from datasets import load_dataset\n",
+    "import pandas as pd\n",
+    "\n",
+    "# Load the dataset\n",
+    "def load_and_display_cache(repo_name):\n",
+    "    # Load the dataset from Hugging Face\n",
+    "    cache_dataset = load_dataset(repo_name, split=\"train\")\n",
+    "    \n",
+    "    # Convert to pandas DataFrame for better visualization\n",
+    "    df = pd.DataFrame(cache_dataset)\n",
+    "\n",
+    "    # Display a preview of the dataset\n",
+    "    print(\"Dataset Preview:\")\n",
+    "    print(df.head())\n",
+    "    \n",
+    "    # Example: Display restored image from the first record\n",
+    "    first_record = cache_dataset[0]\n",
+    "    \n",
+    "    print(\"\\nFirst Text Embedding:\")\n",
+    "    print(first_record[\"value\"][\"text_embedding\"][:10], \"...\") \n",
+    "    \n",
+    "    print(\"\\nFirst Vision Embedding:\")\n",
+    "    print(first_record[\"value\"][\"vision_embedding\"][:10], \"...\")\n",
+    "\n",
+    "\n",
+    "load_and_display_cache(\"lea-33/SlightInsight_Cache2\")\n"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.12"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}