Skip to content

Commit 1401ff7

Browse files
authored
Merge pull request #36 from NFDI4BIOIMAGE/28-caching
Implement Caching
2 parents 5f4c39f + e1968df commit 1401ff7

File tree

3 files changed

+502
-0
lines changed

3 files changed

+502
-0
lines changed

Enable_Caching.ipynb

+383
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,383 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"id": "c30cb9b0-405b-48ce-a744-3f811f120a87",
6+
"metadata": {},
7+
"source": [
8+
"## Implement Caching of Text and Visual Embeddings\n",
9+
"\n",
10+
"In this notebook, we establish a method to cache embeddings. By implementing a persistent cache, we don't need to perform costly calculations over and over again for the same pdfs and slides. We can save a lot of time by storing them, once they were calculated and just fetch the desired outcome if we need it again.\n",
11+
"\n",
12+
"- caching_local: Calculates embeddings (text and visual) if they are not calculated yet. Results are then stored in a local file using python's shelve module.\n",
13+
"- caching_hf: Also calculates embeddings (text and visual) if they are not calculated yet. Results are then stored in a Caching file on Huggingface.\n",
14+
"\n"
15+
]
16+
},
17+
{
18+
"cell_type": "code",
19+
"execution_count": 1,
20+
"id": "93cae03f-4adf-47c6-af18-cdf18b674f3c",
21+
"metadata": {},
22+
"outputs": [],
23+
"source": [
24+
"from caching import caching_hf, caching_local"
25+
]
26+
},
27+
{
28+
"cell_type": "markdown",
29+
"id": "e1713e3c-68c9-44c2-9925-43ce79d75e80",
30+
"metadata": {},
31+
"source": [
32+
"### 1. Caching the results on the local disc"
33+
]
34+
},
35+
{
36+
"cell_type": "code",
37+
"execution_count": 2,
38+
"id": "f16dde9b-6c37-427e-9a4b-c7c0da28c44a",
39+
"metadata": {},
40+
"outputs": [
41+
{
42+
"name": "stdout",
43+
"output_type": "stream",
44+
"text": [
45+
"Caching slide 1\n",
46+
"Caching slide 2\n",
47+
"Caching slide 3\n",
48+
"Caching slide 4\n",
49+
"Caching slide 5\n",
50+
"Caching slide 6\n",
51+
"Caching slide 7\n",
52+
"Caching slide 8\n",
53+
"Caching slide 9\n",
54+
"It took 3.67 seconds to calculate the embeddings.\n"
55+
]
56+
}
57+
],
58+
"source": [
59+
"import time\n",
60+
"pdf_path = \"WhatIsOMERO.pdf\" # Path to your PDF\n",
61+
"\n",
62+
"start_time = time.time()\n",
63+
"\n",
64+
"caching_local(pdf_path)\n",
65+
"\n",
66+
"end_time= time.time()\n",
67+
"duration= end_time - start_time\n",
68+
"print(f'It took {duration:.2f} seconds to calculate the embeddings.')"
69+
]
70+
},
71+
{
72+
"cell_type": "markdown",
73+
"id": "9f891dd1-7d08-4dc6-a824-7ecb80064d6e",
74+
"metadata": {},
75+
"source": [
76+
"When performing the same task again, the Embeddings are already stored in the Cache and the calculation should be much faster:"
77+
]
78+
},
79+
{
80+
"cell_type": "code",
81+
"execution_count": 3,
82+
"id": "c47466dd-71ae-4cdb-b83a-6ea488f26b07",
83+
"metadata": {},
84+
"outputs": [
85+
{
86+
"name": "stdout",
87+
"output_type": "stream",
88+
"text": [
89+
"Fetching from cache: Slide 1\n",
90+
"Fetching from cache: Slide 2\n",
91+
"Fetching from cache: Slide 3\n",
92+
"Fetching from cache: Slide 4\n",
93+
"Fetching from cache: Slide 5\n",
94+
"Fetching from cache: Slide 6\n",
95+
"Fetching from cache: Slide 7\n",
96+
"Fetching from cache: Slide 8\n",
97+
"Fetching from cache: Slide 9\n",
98+
"It took 0.01 seconds to fetch the embeddings from the cache.\n"
99+
]
100+
}
101+
],
102+
"source": [
103+
"start_time = time.time()\n",
104+
"\n",
105+
"caching_local(pdf_path)\n",
106+
"\n",
107+
"end_time= time.time()\n",
108+
"duration= end_time - start_time\n",
109+
"print(f'It took {duration:.2f} seconds to fetch the embeddings from the cache.')"
110+
]
111+
},
112+
{
113+
"cell_type": "markdown",
114+
"id": "020bd6b5-b62c-4aa3-b597-172ba9128305",
115+
"metadata": {},
116+
"source": [
117+
"### 2. Caching the results online via Huggingface\n",
118+
"\n",
119+
"You need to install the Huggingface Hub Package first and create an Account on [Huggingface](https://huggingface.co/). You also have to create a [Huggingface Token](https://huggingface.co/docs/hub/security-tokens) and set this as a environment variable. To get more information on how to do that, check out the [ReadMe](https://github.com/NFDI4BIOIMAGE/SlideInsight/blob/main/README.md).\n",
120+
"In this example the Data is stored in my Repository on Huggingface."
121+
]
122+
},
123+
{
124+
"cell_type": "code",
125+
"execution_count": 4,
126+
"id": "50c38072-c190-41e1-93a3-0618844848c3",
127+
"metadata": {},
128+
"outputs": [
129+
{
130+
"name": "stdout",
131+
"output_type": "stream",
132+
"text": [
133+
"Repository 'lea-33/SlightInsight_Cache2' created.\n",
134+
"Caching Slide 1\n",
135+
"Caching Slide 2\n",
136+
"Caching Slide 3\n",
137+
"Caching Slide 4\n",
138+
"Caching Slide 5\n",
139+
"Caching Slide 6\n",
140+
"Caching Slide 7\n",
141+
"Caching Slide 8\n",
142+
"Caching Slide 9\n"
143+
]
144+
},
145+
{
146+
"data": {
147+
"application/vnd.jupyter.widget-view+json": {
148+
"model_id": "ca45505e21374216a03b11131b576a67",
149+
"version_major": 2,
150+
"version_minor": 0
151+
},
152+
"text/plain": [
153+
"Uploading the dataset shards: 0%| | 0/1 [00:00<?, ?it/s]"
154+
]
155+
},
156+
"metadata": {},
157+
"output_type": "display_data"
158+
},
159+
{
160+
"data": {
161+
"application/vnd.jupyter.widget-view+json": {
162+
"model_id": "5fa9955410274ea7a5f4ae59ea2d0ca9",
163+
"version_major": 2,
164+
"version_minor": 0
165+
},
166+
"text/plain": [
167+
"Creating parquet from Arrow format: 0%| | 0/1 [00:00<?, ?ba/s]"
168+
]
169+
},
170+
"metadata": {},
171+
"output_type": "display_data"
172+
},
173+
{
174+
"name": "stdout",
175+
"output_type": "stream",
176+
"text": [
177+
"Finished caching WhatIsOMERO.pdf\n",
178+
"It took 7.91 seconds to calculate the embeddings.\n"
179+
]
180+
}
181+
],
182+
"source": [
183+
"repo_name = \"lea-33/SlightInsight_Cache2\" # Change this to your Hugging Face repository name\n",
184+
"\n",
185+
"start_time = time.time()\n",
186+
"\n",
187+
"caching_hf(pdf_path, repo_name)\n",
188+
"\n",
189+
"end_time= time.time()\n",
190+
"duration= end_time - start_time\n",
191+
"print(f'It took {duration:.2f} seconds to calculate the embeddings.')"
192+
]
193+
},
194+
{
195+
"cell_type": "markdown",
196+
"id": "6c3c5eae-1ed7-44e0-9c46-777bfc70292c",
197+
"metadata": {},
198+
"source": [
199+
"Again, re-calculating the Embeddings should be faster, because they can now be fetched directly from the storage on Huggingface."
200+
]
201+
},
202+
{
203+
"cell_type": "code",
204+
"execution_count": 5,
205+
"id": "d4f7a6f4-2dba-4912-a8fa-1fd86f7a329d",
206+
"metadata": {},
207+
"outputs": [
208+
{
209+
"name": "stdout",
210+
"output_type": "stream",
211+
"text": [
212+
"Repository 'lea-33/SlightInsight_Cache2' already exists.\n"
213+
]
214+
},
215+
{
216+
"data": {
217+
"application/vnd.jupyter.widget-view+json": {
218+
"model_id": "45ed41f991444173bcfbc1237bdffede",
219+
"version_major": 2,
220+
"version_minor": 0
221+
},
222+
"text/plain": [
223+
"Generating train split: 0%| | 0/9 [00:00<?, ? examples/s]"
224+
]
225+
},
226+
"metadata": {},
227+
"output_type": "display_data"
228+
},
229+
{
230+
"name": "stdout",
231+
"output_type": "stream",
232+
"text": [
233+
"Fetching from cache: Slide 1\n",
234+
"Fetching from cache: Slide 2\n",
235+
"Fetching from cache: Slide 3\n",
236+
"Fetching from cache: Slide 4\n",
237+
"Fetching from cache: Slide 5\n",
238+
"Fetching from cache: Slide 6\n",
239+
"Fetching from cache: Slide 7\n",
240+
"Fetching from cache: Slide 8\n",
241+
"Fetching from cache: Slide 9\n"
242+
]
243+
},
244+
{
245+
"data": {
246+
"application/vnd.jupyter.widget-view+json": {
247+
"model_id": "e5888786c8c4473db71983d77133e77b",
248+
"version_major": 2,
249+
"version_minor": 0
250+
},
251+
"text/plain": [
252+
"Uploading the dataset shards: 0%| | 0/1 [00:00<?, ?it/s]"
253+
]
254+
},
255+
"metadata": {},
256+
"output_type": "display_data"
257+
},
258+
{
259+
"data": {
260+
"application/vnd.jupyter.widget-view+json": {
261+
"model_id": "acf4a0ba48d54fb8b29462f800499a3d",
262+
"version_major": 2,
263+
"version_minor": 0
264+
},
265+
"text/plain": [
266+
"Creating parquet from Arrow format: 0%| | 0/1 [00:00<?, ?ba/s]"
267+
]
268+
},
269+
"metadata": {},
270+
"output_type": "display_data"
271+
},
272+
{
273+
"name": "stderr",
274+
"output_type": "stream",
275+
"text": [
276+
"No files have been modified since last commit. Skipping to prevent empty commit.\n"
277+
]
278+
},
279+
{
280+
"name": "stdout",
281+
"output_type": "stream",
282+
"text": [
283+
"Finished caching WhatIsOMERO.pdf\n",
284+
"It took 3.77 seconds to fetch the embeddings from the cache.\n"
285+
]
286+
}
287+
],
288+
"source": [
289+
"start_time = time.time()\n",
290+
"\n",
291+
"caching_hf(pdf_path, repo_name)\n",
292+
"\n",
293+
"end_time= time.time()\n",
294+
"duration= end_time - start_time\n",
295+
"print(f'It took {duration:.2f} seconds to fetch the embeddings from the cache.')"
296+
]
297+
},
298+
{
299+
"cell_type": "markdown",
300+
"id": "2bb5574a-5fe6-49b0-9cb4-f0e307ef8955",
301+
"metadata": {},
302+
"source": [
303+
"### 3. Load the Dataset from Cache and convert it to a pandas DataFrame for easy processing"
304+
]
305+
},
306+
{
307+
"cell_type": "code",
308+
"execution_count": 6,
309+
"id": "cd042750-3954-42bf-af31-96d94a9607f6",
310+
"metadata": {},
311+
"outputs": [
312+
{
313+
"name": "stdout",
314+
"output_type": "stream",
315+
"text": [
316+
"Dataset Preview:\n",
317+
" key value\n",
318+
"0 WhatIsOMERO.pdf_slide1 {'text_embedding': [0.4003332853317261, -0.336...\n",
319+
"1 WhatIsOMERO.pdf_slide2 {'text_embedding': [0.39082658290863037, -0.28...\n",
320+
"2 WhatIsOMERO.pdf_slide3 {'text_embedding': [0.18631458282470703, -0.37...\n",
321+
"3 WhatIsOMERO.pdf_slide4 {'text_embedding': [0.18063969910144806, -0.60...\n",
322+
"4 WhatIsOMERO.pdf_slide5 {'text_embedding': [-0.44303596019744873, -0.5...\n",
323+
"\n",
324+
"First Text Embedding:\n",
325+
"[0.4003332853317261, -0.33649125695228577, 0.3998110592365265, -0.4730990529060364, -0.5025672316551208, 0.12307340651750565, -0.24336643517017365, -0.3277848958969116, 0.29507237672805786, 0.5909251570701599] ...\n",
326+
"\n",
327+
"First Vision Embedding:\n",
328+
"[-0.037381067872047424, 0.4586034417152405, 0.020449191331863403, 0.13002845644950867, 0.3475934863090515, -0.14490166306495667, -0.16358992457389832, 0.13041885197162628, -0.04649023711681366, 0.08413688838481903] ...\n"
329+
]
330+
}
331+
],
332+
"source": [
333+
"from datasets import load_dataset\n",
334+
"import pandas as pd\n",
335+
"\n",
336+
"# Load the dataset\n",
337+
"def load_and_display_cache(repo_name):\n",
338+
" # Load the dataset from Hugging Face\n",
339+
" cache_dataset = load_dataset(repo_name, split=\"train\")\n",
340+
" \n",
341+
" # Convert to pandas DataFrame for better visualization\n",
342+
" df = pd.DataFrame(cache_dataset)\n",
343+
"\n",
344+
" # Display a preview of the dataset\n",
345+
" print(\"Dataset Preview:\")\n",
346+
" print(df.head())\n",
347+
" \n",
348+
" # Example: Display restored image from the first record\n",
349+
" first_record = cache_dataset[0]\n",
350+
" \n",
351+
" print(\"\\nFirst Text Embedding:\")\n",
352+
" print(first_record[\"value\"][\"text_embedding\"][:10], \"...\") \n",
353+
" \n",
354+
" print(\"\\nFirst Vision Embedding:\")\n",
355+
" print(first_record[\"value\"][\"vision_embedding\"][:10], \"...\")\n",
356+
"\n",
357+
"\n",
358+
"load_and_display_cache(\"lea-33/SlightInsight_Cache2\")\n"
359+
]
360+
}
361+
],
362+
"metadata": {
363+
"kernelspec": {
364+
"display_name": "Python 3 (ipykernel)",
365+
"language": "python",
366+
"name": "python3"
367+
},
368+
"language_info": {
369+
"codemirror_mode": {
370+
"name": "ipython",
371+
"version": 3
372+
},
373+
"file_extension": ".py",
374+
"mimetype": "text/x-python",
375+
"name": "python",
376+
"nbconvert_exporter": "python",
377+
"pygments_lexer": "ipython3",
378+
"version": "3.10.12"
379+
}
380+
},
381+
"nbformat": 4,
382+
"nbformat_minor": 5
383+
}

0 commit comments

Comments
 (0)