Skip to content

Commit 247d7ba

Browse files
Update notebook
Signed-off-by: Andy Kwok <andy.kwok@improving.com>
1 parent e6854b9 commit 247d7ba

1 file changed

Lines changed: 102 additions & 16 deletions

File tree

notebooks/import_s3_table_embedding_demo.ipynb

Lines changed: 102 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -35,7 +35,7 @@
3535
"source": [
3636
"import asyncio\n",
3737
"import os\n",
38-
"\n",
38+
"import pandas as pd\n",
3939
"import boto3\n",
4040
"import dotenv\n",
4141
"\n",
@@ -104,6 +104,39 @@
104104
"In this section, the dataset is modified to append an additional embedding column generated using Amazon Bedrock. The enriched CSV file is then uploaded to Amazon S3 for downstream processing as part of the data lake projection workflow.\n"
105105
]
106106
},
107+
{
108+
"cell_type": "code",
109+
"execution_count": null,
110+
"metadata": {},
111+
"outputs": [],
112+
"source": [
113+
"# Download the fahsion.csv from Kaggle dataset (Only the style.csv).\n",
114+
"# https://www.kaggle.com/datasets/paramaggarwal/fashion-product-images-small\n",
115+
"data_path = \"../example/resources/styles.csv\"\n",
116+
"data_w_embedding_path = \"../example/resources/styles_embedding.csv\"\n",
117+
"\n",
118+
"athena_client = boto3.client('athena')\n",
119+
"\n",
120+
"# Read data from data path\n",
121+
"headers, rows = read_csv(data_path, 10)\n",
122+
"\n",
123+
"# Print out the data file content\n",
124+
"df = pd.DataFrame(rows)\n",
125+
"df\n",
126+
"\n"
127+
]
128+
},
129+
{
130+
"cell_type": "markdown",
131+
"metadata": {},
132+
"source": [
133+
"### Data Enrichment – Embeddings\n",
134+
"\n",
135+
"Next, an append_embedding function is applied to each row to generate an embedding vector from a subset of product attributes (ex: `masterCategory`, `subCategory`.....etc).\n",
136+
"\n",
137+
"The resulting embedding is appended as a new column in the output dataset, and will later be imported into Neptune Analytics for similarity search.\n"
138+
]
139+
},
107140
{
108141
"cell_type": "code",
109142
"execution_count": null,
@@ -128,33 +161,46 @@
128161
" return fieldnames, rows\n",
129162
"\n",
130163
"\n",
131-
"# Download the fahsion.csv from Kaggle dataset (Only the style.csv).\n",
132-
"# https://www.kaggle.com/datasets/paramaggarwal/fashion-product-images-small\n",
133-
"data_path = \"../example/resources/styles.csv\"\n",
134-
"data_w_embedding_path = \"../example/resources/styles_embedding.csv\"\n",
135-
"\n",
136-
"athena_client = boto3.client('athena')\n",
137-
"\n",
138-
"# Read data from data path\n",
139-
"headers, rows = read_csv(data_path, 10)\n",
140164
"# Add the embedding\n",
141165
"headers, rows = append_embedding(headers, rows)\n",
166+
"\n",
142167
"# Write to new csv\n",
143168
"write_csv(data_w_embedding_path, headers, rows)\n",
144169
"\n",
170+
"# Print out the data file content\n",
171+
"df = pd.DataFrame(rows)\n",
172+
"df\n"
173+
]
174+
},
175+
{
176+
"cell_type": "markdown",
177+
"metadata": {},
178+
"source": [
179+
"### Upload dataset\n",
180+
"\n",
181+
"Once the embedding column has been added, the enriched dataset is uploaded to Amazon S3.\n",
182+
"\n",
183+
"This S3 object serves as the input data source for subsequent data lake projection and import into Neptune Analytics."
184+
]
185+
},
186+
{
187+
"cell_type": "code",
188+
"execution_count": null,
189+
"metadata": {},
190+
"outputs": [],
191+
"source": [
145192
"# Push to s3\n",
146193
"empty_s3_bucket(s3_location_data_lake)\n",
147194
"push_to_s3(data_w_embedding_path, _clean_s3_path(s3_location_data_lake),\"styles_embedding.csv\")\n",
148195
"\n",
149-
"\n",
150-
"print(\"Completed data preparation.\")\n"
196+
"print(\"DataLake preparation completed.\")"
151197
]
152198
},
153199
{
154200
"cell_type": "markdown",
155201
"metadata": {},
156202
"source": [
157-
"### Data Projection\n",
203+
"## Data Projection\n",
158204
"\n",
159205
"Once the data source has been uploaded successfully, two Amazon Athena queries are executed:\n",
160206
"\n",
@@ -222,9 +268,7 @@
222268
"source": [
223269
"## Import Data into Neptune Analytics and Perform Similarity Search\n",
224270
"\n",
225-
"Once the compatible import file has been generated, the import process can be triggered to load the dataset into Amazon Neptune Analytics.\n",
226-
"\n",
227-
"After the import completes successfully, a `topK.byNode` query is executed to perform similarity search. This step verifies that the embedding vectors have been imported correctly and that the TopK algorithm can identify products with similar characteristics based on their embeddings.\n"
271+
"Once the compatible import file has been generated, the import process can be triggered to load the dataset into Amazon Neptune Analytics.\n"
228272
]
229273
},
230274
{
@@ -241,6 +285,48 @@
241285
" )\n"
242286
]
243287
},
288+
{
289+
"cell_type": "markdown",
290+
"metadata": {},
291+
"source": [
292+
"### Inspect Embedding\n",
293+
"\n",
294+
"A simple query is used to inspect the imported embeddings by printing the first 5 floating-point values from each node’s embedding vector. \n",
295+
"\n",
296+
"This provides a quick sanity check to verify that the embedding data has been ingested and stored correctly before running similarity queries."
297+
]
298+
},
299+
{
300+
"cell_type": "code",
301+
"execution_count": null,
302+
"metadata": {},
303+
"outputs": [],
304+
"source": [
305+
"TOPK_QUERY = \"\"\"\n",
306+
" MATCH (n) \n",
307+
" CALL neptune.algo.vectors.get(n) \n",
308+
" YIELD embedding RETURN n, embedding[0..5] as embedding_first_five\n",
309+
" limit 3\n",
310+
"\"\"\"\n",
311+
"\n",
312+
"config = set_config_graph_id(graph_id)\n",
313+
"na_graph = NeptuneGraph.from_config(config)\n",
314+
"all_nodes = na_graph.execute_call(TOPK_QUERY)\n",
315+
"for n in all_nodes:\n",
316+
" print(n[\"n\"][\"~id\"] + \": \" + str(n[\"embedding_first_five\"]))"
317+
]
318+
},
319+
{
320+
"cell_type": "markdown",
321+
"metadata": {},
322+
"source": [
323+
"### Similarity Search\n",
324+
"\n",
325+
"You can now run `neptune.algo.vectors.topK.byNode` to perform similarity search using the imported embedding vectors.\n",
326+
"\n",
327+
"This query returns the top-K most similar nodes along with their similarity scores, confirming that the embeddings are correctly integrated and usable for semantic similarity search in Amazon Neptune Analytics."
328+
]
329+
},
244330
{
245331
"cell_type": "code",
246332
"execution_count": null,

0 commit comments

Comments
 (0)