-
Notifications
You must be signed in to change notification settings - Fork 2
Notebook: Databricks connector #128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
andy-k-improving
merged 15 commits into
ft-hf-databricks-connector-feature-branch
from
ft-hf-db-notebook
Apr 8, 2026
Merged
Changes from all commits
Commits
Show all changes
15 commits
Select commit
Hold shift + click to select a range
c59e4b3
Init commit
andy-k-improving 4db6878
Folder restrcuture
andy-k-improving 8e01c93
Update tests
andy-k-improving b1f33e3
SAM template
andy-k-improving eae83a1
Update doc
andy-k-improving 02436a3
Databricks notebook
andy-k-improving e35bcef
Update notebook
andy-k-improving c38bb5e
Update testcase
andy-k-improving 9826d7d
Update notebooks/import_databricks_demo.ipynb
andy-k-improving dad45d9
Update notebooks/import_databricks_demo.ipynb
andy-k-improving aad7639
Update notebooks/import_databricks_demo.ipynb
andy-k-improving 378fe15
Update notebooks/import_databricks_demo.ipynb
andy-k-improving 4d7df12
Update notebooks/import_databricks_demo.ipynb
andy-k-improving d411d8a
Update doc
andy-k-improving 01dfdb9
Make NEV optional
andy-k-improving File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,374 @@ | ||
| { | ||
| "cells": [ | ||
| { | ||
| "cell_type": "markdown", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "# Importing Databricks Data into Neptune Analytics via Athena Federation\n", | ||
| "This notebook demonstrates how to connect PaySim data stored in Databricks and, using Athena Federated Query, create a graph view of the data in Neptune Analytics. \n", | ||
| "\n", | ||
| "\n", | ||
| "### Prerequisite\n", | ||
| "\n", | ||
| "To enable querying Databricks connector from Amazon Athena, the Athena Databricks Connector must first be deployed in your AWS account.\n", | ||
| "Deployment and setup instructions are available in the following resources:\n", | ||
| "\n", | ||
| "**Installation guide**\n", | ||
| "\n", | ||
| "../connectors/athena-databricks-connector/README.md\n", | ||
| "\n", | ||
| "\n", | ||
| "### What this notebook covers:\n", | ||
| "1. Download the [Kaggle PaySim1 dataset](https://www.kaggle.com/datasets/ealaxi/paysim1), a synthetic financial dataset simulating mobile money transactions, and upload it to Databricks.\n", | ||
| "\n", | ||
| "2. Use Amazon Athena to federate queries against the Databricks table and generate vertex/edge projections\n", | ||
| "3. Import the projections into Amazon Neptune Analytics\n", | ||
| "4. Run community detection (Louvain) to identify transaction clusters\n" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "markdown", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "## Setup\n", | ||
| "\n", | ||
| "Import the necessary libraries and set up logging." | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": null, | ||
| "metadata": {}, | ||
| "outputs": [], | ||
| "source": [ | ||
| "import dotenv\n", | ||
| "import kagglehub\n", | ||
| "from pathlib import Path\n", | ||
| "from databricks.sdk import WorkspaceClient\n", | ||
| "from databricks import sql\n", | ||
| "\n", | ||
| "from nx_neptune import empty_s3_bucket, instance_management, NeptuneGraph, set_config_graph_id\n", | ||
| "from nx_neptune.instance_management import execute_athena_query, _clean_s3_path, get_athena_query_results\n", | ||
| "from nx_neptune.utils.utils import get_stdout_logger, validate_and_get_env\n", | ||
| "\n", | ||
| "\n", | ||
| "# Configure logging to see detailed information about the instance creation process\n", | ||
| "logger = get_stdout_logger(__name__, [\n", | ||
| " 'nx_neptune.instance_management',\n", | ||
| " 'nx_neptune.utils.task_future',\n", | ||
| " 'nx_neptune.interface',\n", | ||
| " __name__\n", | ||
| "])" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "markdown", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "## Configuration\n", | ||
| "\n", | ||
| "Check for environment variables necessary for the notebook." | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": null, | ||
| "metadata": {}, | ||
| "outputs": [], | ||
| "source": [ | ||
| "# Required environment variables for Neptune Analytics and Athena \n", | ||
| "dotenv.load_dotenv() \n", | ||
| "env_vars = validate_and_get_env([ \n", | ||
| " 'NETWORKX_S3_DATA_LAKE_BUCKET_PATH', \n", | ||
| " 'NETWORKX_S3_NA_IMPORT_BUCKET_PATH', \n", | ||
| " 'NETWORKX_S3_LOG_BUCKET_PATH', \n", | ||
| " 'NETWORKX_S3_TABLES_DATABASE', \n", | ||
| " 'NETWORKX_S3_TABLES_TABLENAME', \n", | ||
| " 'NETWORKX_GRAPH_ID', \n", | ||
| "]) \n", | ||
| " \n", | ||
| "(s3_location_data_lake, s3_location_na_import, s3_location_log, \n", | ||
| "s3_tables_database, s3_tables_tablename, graph_id) = env_vars.values() \n", | ||
| " \n", | ||
| "# Optional — only needed to upload test data to Databricks (skip if table already exists) \n", | ||
| "db_env = validate_and_get_env([ \n", | ||
| " 'DATABRICKS_HOST', \n", | ||
| " 'DATABRICKS_TOKEN', \n", | ||
| " 'DATABRICKS_HTTP_PATH', \n", | ||
| "]) \n", | ||
| " \n", | ||
| "db_host, db_token, db_http_path = db_env.values() \n" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "markdown", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "## Test Data Setup\n", | ||
| "\n", | ||
| "The transaction dataset used in this demo is sourced from [Kaggle PaySim1 dataset](https://www.kaggle.com/datasets/ealaxi/paysim1)\n", | ||
| ", a synthetic financial dataset simulating mobile money\n", | ||
| "transactions.\n", | ||
| "\n", | ||
| "The setup cell below automates the full data ingestion pipeline:\n", | ||
| "\n", | ||
| "1. Download — The dataset is fetched programmatically using the kagglehub package, which \n", | ||
| "handles caching to avoid redundant downloads on subsequent runs.\n", | ||
| "\n", | ||
| "2. Upload to Databricks Volume — The CSV file is uploaded to a Unity Catalog Volume via the \n", | ||
| "Databricks SDK, staging it in cloud storage accessible by the SQL Warehouse.\n", | ||
| "\n", | ||
| "3. Create Delta Table — A `CREATE TABLE AS SELECT` statement reads the CSV from the Volume and \n", | ||
| "materializes it as a managed Delta table. Schema is inferred automatically from the CSV \n", | ||
| "headers.\n", | ||
| "\n", | ||
| "If the table already exists, all three steps are skipped entirely." | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": null, | ||
| "metadata": {}, | ||
| "outputs": [], | ||
| "source": [ | ||
| "volume_path = \"/Volumes/workspace/default/test_paysim_vol\"\n", | ||
| "table_name = \"workspace.default.paysim_transactions\"\n", | ||
| "\n", | ||
| "with sql.connect(server_hostname=db_host, http_path=db_http_path, access_token=db_token) as conn:\n", | ||
| " cursor = conn.cursor()\n", | ||
| " cursor.execute(f\"SHOW TABLES IN workspace.default LIKE 'paysim_transactions'\")\n", | ||
| " if cursor.fetchone():\n", | ||
| " print(f\"{table_name} already exists, skipping\")\n", | ||
| " else:\n", | ||
| " print (\"Not exist\")\n", | ||
| " # Download from Kaggle\n", | ||
| " paysim_path = Path(kagglehub.dataset_download(\"ealaxi/paysim1\"))\n", | ||
| " csv_file = next(paysim_path.glob(\"*.csv\"))\n", | ||
| "\n", | ||
| " # # Upload to Volume\n", | ||
| " w = WorkspaceClient(host=f\"https://{db_host}\", token=db_token)\n", | ||
| " with open(csv_file, \"rb\") as f:\n", | ||
| " w.files.upload(f\"{volume_path}/{csv_file.name}\", f, overwrite=True)\n", | ||
| "\n", | ||
| " # Create table\n", | ||
| " cursor.execute(f\"\"\"\n", | ||
| " CREATE TABLE {table_name} AS\n", | ||
| " SELECT * FROM read_files(\n", | ||
| " '{volume_path}/{csv_file.name}',\n", | ||
| " format => 'csv', header => true, inferSchema => true)\"\"\")\n", | ||
| "\n", | ||
| " print(f\"Created {table_name}\")\n", | ||
| " cursor.close()" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "markdown", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "## Data Verification\n", | ||
| "\n", | ||
| "Quick sanity check to confirm the Databricks table is accessible via the Athena federated connector." | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": null, | ||
| "metadata": {}, | ||
| "outputs": [], | ||
| "source": [ | ||
| "query = 'SELECT * FROM \"lambda:databricks\".\"default\".\"paysim_transactions\" LIMIT 1'\n", | ||
| "\n", | ||
| "result = await execute_athena_query(query, s3_location_na_import, database=s3_tables_database)\n", | ||
| "query_id = result[0].task_id\n", | ||
| "\n", | ||
| "rows = get_athena_query_results(query_id)\n", | ||
| "assert len(rows) == 2, f\"Expected 2 rows (1 header + 1 data), got {len(rows)}\"\n" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "markdown", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "## Data Transformation and Graph Import\n", | ||
| "\n", | ||
| "In this step, Amazon Athena federates queries against the Databricks Unity Catalog table via \n", | ||
| "the Athena-Databricks connector. Two projections are generated:\n", | ||
| "\n", | ||
| "1. Vertex CSV — Extracts distinct customer IDs (both source and destination) from the \n", | ||
| "transaction dataset to create graph nodes.\n", | ||
| "2. Edge CSV — Maps each transaction as a directed edge between customers, carrying transaction \n", | ||
| "attributes (type, amount, balances, fraud flags) as edge properties.\n", | ||
| "\n", | ||
| "Both projections are written to S3 in Neptune Analytics' CSV import format, cleaned of Athena \n", | ||
| "metadata files, and then bulk-imported into the graph.\n", | ||
| "\n", | ||
| "After completion, the graph contains customer nodes connected by transaction edges — ready for\n", | ||
| "graph analytics (e.g., fraud ring detection, centrality analysis).\n", | ||
| "\n", | ||
| "│ **Troubleshooting:** If the Athena federated connector is not configured properly (e.g., the Lambda \n", | ||
| "function does not exist or the connector name is incorrect), you will receive a \n", | ||
| "`GENERIC_USER_ERROR` with a `ResourceNotFoundException`. Ensure the connector Lambda function is \n", | ||
| "deployed and the catalog name in your query (e.g., lambda:databricks) matches the registered \n", | ||
| "connector name.\n" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": null, | ||
| "metadata": {}, | ||
| "outputs": [], | ||
| "source": [ | ||
| "# Clear import directory\n", | ||
| "empty_s3_bucket(s3_location_na_import)\n", | ||
| "\n", | ||
| "# Generate vertex and edge projections from Databricks table\n", | ||
| "databricks_table_ref=f'\"lambda:databricks\".\"default\".\"paysim_transactions\"'\n", | ||
| "\n", | ||
| "SOURCE_AND_DESTINATION_CUSTOMERS = f\"\"\"\n", | ||
| "SELECT DISTINCT \"~id\", 'customer' AS \"~label\"\n", | ||
| "FROM (\n", | ||
| " SELECT NAMEORIG as \"~id\" FROM {databricks_table_ref} WHERE NAMEORIG IS NOT NULL\n", | ||
| " UNION ALL\n", | ||
| " SELECT NAMEDEST as \"~id\" FROM {databricks_table_ref} WHERE NAMEDEST IS NOT NULL\n", | ||
| ")\n", | ||
| "\"\"\"\n", | ||
| "\n", | ||
| "BANK_TRANSACTIONS = f\"\"\"\n", | ||
| "SELECT\n", | ||
| " NAMEORIG as \"~from\",\n", | ||
| " NAMEDEST as \"~to\",\n", | ||
| " TYPE AS \"~label\",\n", | ||
| " STEP AS \"step:Int\",\n", | ||
| " AMOUNT AS \"amount:Float\",\n", | ||
| " OLDBALANCEORG AS \"oldbalanceOrg:Float\",\n", | ||
| " NEWBALANCEORIG AS \"newbalanceOrig:Float\",\n", | ||
| " OLDBALANCEDEST AS \"oldbalanceDest:Float\",\n", | ||
| " NEWBALANCEDEST AS \"newbalanceDest:Float\",\n", | ||
| " ISFRAUD AS \"isFraud:Int\",\n", | ||
| " ISFLAGGEDFRAUD AS \"isFlaggedFraud:Int\"\n", | ||
| "FROM {databricks_table_ref} WHERE NAMEORIG IS NOT NULL AND NAMEDEST IS NOT NULL\n", | ||
| "\"\"\"\n", | ||
| "\n", | ||
| "await execute_athena_query(SOURCE_AND_DESTINATION_CUSTOMERS, s3_location_na_import, database=s3_tables_database, polling_interval=15)\n", | ||
| "await execute_athena_query(BANK_TRANSACTIONS, s3_location_na_import, database=s3_tables_database, polling_interval=15)\n", | ||
| "\n", | ||
| "# # Remove unnecessary .csv.metadata file generated by Athena. \n", | ||
| "empty_s3_bucket(s3_location_na_import, file_extension=\".csv.metadata\")\n", | ||
| "\n", | ||
| "task_id = await instance_management.import_csv_from_s3(\n", | ||
| " NeptuneGraph.from_config(set_config_graph_id(graph_id)),\n", | ||
| " s3_location_na_import)" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "markdown", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "\n", | ||
| "## Graph Analytics\n", | ||
| "\n", | ||
| "With the transaction graph loaded, we run community detection to identify clusters of \n", | ||
| "customers with dense transaction patterns — a common technique for fraud ring detection.\n", | ||
| "\n", | ||
| "1. Verify Import — Confirm nodes (customers) and edges (transactions) were loaded correctly.\n", | ||
| "2. Community Detection — Run the Louvain algorithm on Neptune Analytics to partition the graph \n", | ||
| "into communities, writing the result as a community property on each node.\n", | ||
| "3. Analyze Communities — Query the top 10 communities by size to understand the graph's \n", | ||
| "structure and identify unusually large or tightly connected groups for further investigation.\n", | ||
| "\n" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": null, | ||
| "metadata": {}, | ||
| "outputs": [], | ||
| "source": [ | ||
| "config = set_config_graph_id(graph_id)\n", | ||
| "na_graph = NeptuneGraph.from_config(config)\n", | ||
| "\n", | ||
| "# Verify nodes\n", | ||
| "all_nodes = na_graph.execute_call(\"MATCH (n) RETURN n LIMIT 3\")\n", | ||
| "print(\"Sample Nodes:\")\n", | ||
| "for n in all_nodes:\n", | ||
| " print(f\" {n['n']['~id']} ({n['n']['~labels'][0]})\")\n", | ||
| "\n", | ||
| "# Verify edges\n", | ||
| "all_edges = na_graph.execute_call(\"MATCH ()-[r]-() RETURN r LIMIT 5\")\n", | ||
| "print(\"\\nSample Edges:\")\n", | ||
| "for e in all_edges:\n", | ||
| " r = e[\"r\"]\n", | ||
| " print(f\" {r['~start']} --[{r['~type']}, amount: {r['~properties']['amount']}]--> {r['~end']}\")\n" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": null, | ||
| "metadata": {}, | ||
| "outputs": [], | ||
| "source": [ | ||
| "# Run Louvain algorithm and mutate graph with community property\n", | ||
| "louvain_result = na_graph.execute_call(\n", | ||
| " 'CALL neptune.algo.louvain.mutate({iterationTolerance:1e-07, writeProperty:\"community\"}) '\n", | ||
| " 'YIELD success AS success RETURN success'\n", | ||
| ")\n", | ||
| "print(f\"Louvain result: {louvain_result}\")" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": null, | ||
| "metadata": {}, | ||
| "outputs": [], | ||
| "source": [ | ||
| "# Find the top 10 communities by size\n", | ||
| "top_communities = na_graph.execute_call(\"\"\"\n", | ||
| "MATCH (n)\n", | ||
| "WHERE n.community IS NOT NULL\n", | ||
| "RETURN n.community AS community, count(*) AS community_size\n", | ||
| "ORDER BY community_size DESC\n", | ||
| "LIMIT 10\n", | ||
| "\"\"\")\n", | ||
| "\n", | ||
| "print(\"Top 10 Communities:\")\n", | ||
| "print(f\" {'Community ID':>14} {'Size':>6}\")\n", | ||
| "print(f\" {'─' * 14} {'─' * 6}\")\n", | ||
| "for c in top_communities:\n", | ||
| " print(f\" {c['community']:>14} {c['community_size']:>6}\")\n" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "markdown", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "## Conclusion\n", | ||
| "\n", | ||
| "This notebook demonstrated an end-to-end workflow for federated graph analytics: \n", | ||
| "Sourcing transaction data from Databricks via the Athena-Databricks connector, transforming it into a graph-compatible format with Athena, importing it into Neptune Analytics, and running community detection to surface transaction clusters.\n", | ||
| "\n", | ||
| "This pattern enables teams to leverage existing data in Databricks without data duplication — Athena federates the query at runtime, and Neptune Analytics provides the graph compute layer for analytics that relational engines aren't designed for." | ||
| ] | ||
| } | ||
| ], | ||
| "metadata": { | ||
| "kernelspec": { | ||
| "display_name": "Python 3 (ipykernel)", | ||
| "language": "python", | ||
| "name": "python3" | ||
| }, | ||
| "language_info": { | ||
| "codemirror_mode": { | ||
| "name": "ipython", | ||
| "version": 3 | ||
| }, | ||
| "file_extension": ".py", | ||
| "mimetype": "text/x-python", | ||
| "name": "python", | ||
| "nbconvert_exporter": "python", | ||
| "pygments_lexer": "ipython3", | ||
| "version": "3.13.12" | ||
| } | ||
| }, | ||
| "nbformat": 4, | ||
| "nbformat_minor": 4 | ||
| } |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.