|
176 | 176 | "cell_type": "markdown", |
177 | 177 | "metadata": {}, |
178 | 178 | "source": [ |
179 | | - "### Upload dataset\n", |
| 179 | + "### Upload Dataset and Register in Athena\n", |
180 | 180 | "\n", |
181 | | - "Once the embedding column has been added, the enriched dataset is uploaded to Amazon S3.\n", |
| 181 | + "After the embedding column is added, the enriched dataset is uploaded to Amazon S3.\n", |
182 | 182 | "\n", |
183 | | - "This S3 object serves as the input data source for subsequent data lake projection and import into Neptune Analytics." |
| 183 | + "An external table is then created in Amazon Athena over the uploaded CSV, exposing both the original attributes and the embedding array for SQL-based access." |
184 | 184 | ] |
185 | 185 | }, |
186 | 186 | { |
|
192 | 192 | "# Push to s3\n", |
193 | 193 | "empty_s3_bucket(s3_location_data_lake)\n", |
194 | 194 | "push_to_s3(data_w_embedding_path, _clean_s3_path(s3_location_data_lake),\"styles_embedding.csv\")\n", |
195 | | - "\n", |
196 | | - "print(\"DataLake preparation completed.\")" |
197 | | - ] |
198 | | - }, |
199 | | - { |
200 | | - "cell_type": "markdown", |
201 | | - "metadata": {}, |
202 | | - "source": [ |
203 | | - "## Data Projection\n", |
204 | | - "\n", |
205 | | - "Once the data source has been uploaded successfully, two Amazon Athena queries are executed:\n", |
206 | | - "\n", |
207 | | - "1. Create an external table over the uploaded dataset\n", |
208 | | - "2. Run a SQL projection that selects a subset of columns, including the embedding vector\n", |
209 | | - "\n", |
210 | | - "This process produces a projected .csv file that is compatible with Amazon Neptune Analytics import requirements, supporting both node property data and embedding vectors in a single file.\n" |
211 | | - ] |
212 | | - }, |
213 | | - { |
214 | | - "cell_type": "code", |
215 | | - "execution_count": null, |
216 | | - "metadata": {}, |
217 | | - "outputs": [], |
218 | | - "source": [ |
219 | 195 | "\n", |
220 | 196 | "# Create external data\n", |
221 | 197 | "create_csv_table_stmt = f\"\"\"\n", |
|
242 | 218 | "\n", |
243 | 219 | "_execute_athena_query(athena_client, create_csv_table_stmt, s3_location_log, database=s3_tables_database)\n", |
244 | 220 | "\n", |
| 221 | + "print(\"DataLake preparation completed.\")" |
| 222 | + ] |
| 223 | + }, |
| 224 | + { |
| 225 | + "cell_type": "markdown", |
| 226 | + "metadata": {}, |
| 227 | + "source": [ |
| 228 | + "## Import Data into Neptune Analytics and Perform Similarity Search\n", |
| 229 | + "\n", |
| 230 | + "A projection query is executed in Athena to select the required columns, map Neptune-compatible headers, and flatten the embedding array into a vector format.\n", |
| 231 | + "\n", |
| 232 | + "The resulting CSV is compatible with Amazon Neptune Analytics import requirements and can be ingested directly to enable vector similarity search on the graph.\n" |
| 233 | + ] |
| 234 | + }, |
| 235 | + { |
| 236 | + "cell_type": "code", |
| 237 | + "execution_count": null, |
| 238 | + "metadata": {}, |
| 239 | + "outputs": [], |
| 240 | + "source": [ |
| 241 | + "# Clear import directory\n", |
245 | 242 | "empty_s3_bucket(s3_location_import)\n", |
246 | 243 | "\n", |
247 | 244 | "# Projection\n", |
|
259 | 256 | "_execute_athena_query(athena_client, create_csv_table_stmt, s3_location_import, database=s3_tables_database)\n", |
260 | 257 | "\n", |
261 | 258 | "# Remove unnecessary .csv.metadata file generated by Athena. \n", |
262 | | - "empty_s3_bucket(s3_location_import, file_extension=\".csv.metadata\")" |
263 | | - ] |
264 | | - }, |
265 | | - { |
266 | | - "cell_type": "markdown", |
267 | | - "metadata": {}, |
268 | | - "source": [ |
269 | | - "## Import Data into Neptune Analytics and Perform Similarity Search\n", |
| 259 | + "empty_s3_bucket(s3_location_import, file_extension=\".csv.metadata\")\n", |
270 | 260 | "\n", |
271 | | - "Once the compatible import file has been generated, the import process can be triggered to load the dataset into Amazon Neptune Analytics.\n" |
272 | | - ] |
273 | | - }, |
274 | | - { |
275 | | - "cell_type": "code", |
276 | | - "execution_count": null, |
277 | | - "metadata": {}, |
278 | | - "outputs": [], |
279 | | - "source": [ |
280 | 261 | "task_id = await instance_management.import_csv_from_s3(\n", |
281 | 262 | " NeptuneGraph.from_config(set_config_graph_id(graph_id)),\n", |
282 | 263 | " s3_location_import,\n", |
|
0 commit comments