Skip to content

Commit 6259b06

Browse files
committed
Update files
1 parent 47918dc commit 6259b06

26 files changed

+1092
-334
lines changed

README.md

Lines changed: 22 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -59,16 +59,26 @@ Create a **Dedicated** cluster with the Neo4j Spark Connector:
5959

6060
### 2. Import the Workshop
6161

62-
1. Clone or download this repository
63-
2. In Databricks, go to **Workspace**
64-
3. Click **Import** and upload the `labs/` folder
62+
In Databricks, go to **Workspace** > right-click your user folder > **Import** > **URL** and paste:
63+
64+
```
65+
<DBC_URL>
66+
```
67+
68+
This imports all lab notebooks into your workspace. Data files (CSV, HTML, embeddings) are downloaded automatically from GitHub when you run the setup notebook.
69+
70+
> **Alternative:** If you prefer to import manually, clone the repo and use the Databricks CLI:
71+
> ```bash
72+
> git clone https://github.com/neo4j-partners/graph-enrichment.git
73+
> databricks workspace import-dir graph-enrichment/labs /Users/<your-email>/graph-enrichment
74+
> ```
6575
6676
### 3. Run Required Setup
6777
68-
Open and run **labs/0 - Required Setup**. It will:
78+
Open and run **0 - Required Setup**. It will:
6979
7080
- Create a catalog, schema, and volume based on your username
71-
- Copy all data files (CSV, HTML, and pre-computed embeddings) to the volume
81+
- Download all data files (CSV, HTML, and pre-computed embeddings) from GitHub into your volume
7282
- Prompt you for Neo4j connection details and store them as Databricks secrets
7383
- Verify the Neo4j connection
7484
@@ -99,15 +109,19 @@ graph-enrichment/
99109
├── labs/
100110
│ ├── 0 - Required Setup.py # Environment setup notebook
101111
│ ├── 1 - Neo4j Import.py # Single-step Neo4j data import
112+
│ ├── 4 - Neo4j to Lakehouse.py # Export graph to Delta tables
113+
│ ├── 5 - AI Agents.py # Genie + Knowledge Assistant
114+
│ ├── 6 - Supervisor Agent.py # Multi-agent coordinator
102115
│ └── Includes/
103-
│ ├── config.yaml # Workshop configuration
116+
│ ├── config.py # Workshop configuration (imported via %run)
104117
│ ├── _lib/
105-
│ │ ├── setup_orchestrator.py # Setup logic
118+
│ │ ├── setup_orchestrator.py # Setup + GitHub data download
106119
│ │ └── neo4j_import.py # Import logic
107120
│ └── data/
108121
│ ├── csv/ # Source CSV files (7 files)
109122
│ ├── html/ # Source HTML documents (14 files)
110123
│ └── embeddings/ # Pre-computed embedding vectors
124+
├── build_dbc.py # Script to package labs/ as a .dbc archive
111125
├── lab_7_augmentation_agent/ # Lab 7: Graph Augmentation
112126
├── full_demo/ # Reference implementation, validation scripts, and admin tools
113127
├── docs/ # Reference documentation
@@ -124,11 +138,10 @@ graph-enrichment/
124138
| Runtime | 13.3 LTS ML or higher (Spark 3.x) |
125139
| Maven Library | `org.neo4j:neo4j-connector-apache-spark_2.12:5.3.1_for_spark_3` |
126140
127-
The **ML Runtime** is recommended because it includes `pyyaml`, `neo4j`, and `beautifulsoup4`. If using a standard (non-ML) runtime, install these Python packages as cluster libraries:
141+
The **ML Runtime** is recommended because it includes `neo4j` and `beautifulsoup4`. If using a standard (non-ML) runtime, install these Python packages as cluster libraries:
128142
129143
| Package | Used By |
130144
|---------|---------|
131-
| `pyyaml` | Setup notebook (reads config.yaml) |
132145
| `neo4j` | Import notebook (Neo4j Python driver for document graph) |
133146
| `beautifulsoup4` | Embedding generation (`generate_embeddings.py`, not student-facing) |
134147
| `databricks-langchain` | Embedding generation (`generate_embeddings.py`, not student-facing) |
@@ -172,7 +185,6 @@ Any issues discovered through the use of this project should be filed as GitHub
172185
| pydantic | Data validation | MIT | https://github.com/pydantic/pydantic |
173186
| mlflow | ML experiment tracking | Apache 2.0 | https://github.com/mlflow/mlflow |
174187
| beautifulsoup4 | HTML parsing | MIT | https://www.crummy.com/software/BeautifulSoup/ |
175-
| pyyaml | YAML parsing | MIT | https://github.com/yaml/pyyaml |
176188
| sentence-transformers | Embedding models | Apache 2.0 | https://github.com/UKPLab/sentence-transformers |
177189
178190
&copy; 2026 Databricks, Inc. All rights reserved. The source in this notebook is provided subject to the [Databricks License](https://databricks.com/db-license-source). All included or referenced third party libraries are subject to the licenses set forth above.

build_dbc.py

Lines changed: 88 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,88 @@
1+
#!/usr/bin/env python3
2+
"""
3+
Build a .dbc archive from the labs/ directory.
4+
5+
A .dbc file is a ZIP archive containing Databricks notebooks. Each notebook
6+
is stored as a JSON entry with its source code, language, and relative path.
7+
8+
Usage:
9+
python build_dbc.py # outputs labs.dbc
10+
python build_dbc.py -o my_workshop.dbc # custom output name
11+
"""
12+
13+
import argparse
14+
import base64
15+
import json
16+
import os
17+
import zipfile
18+
19+
LABS_DIR = os.path.join(os.path.dirname(__file__), "labs")
20+
21+
# Map file extensions to Databricks language identifiers
22+
LANG_MAP = {
23+
".py": "PYTHON",
24+
".sql": "SQL",
25+
".scala": "SCALA",
26+
".r": "R",
27+
}
28+
29+
30+
def build_dbc(labs_dir: str, output_path: str):
31+
"""Package all notebook files into a .dbc archive."""
32+
notebooks = []
33+
34+
for root, _dirs, files in os.walk(labs_dir):
35+
for filename in sorted(files):
36+
ext = os.path.splitext(filename)[1].lower()
37+
if ext not in LANG_MAP:
38+
continue
39+
40+
filepath = os.path.join(root, filename)
41+
rel_path = os.path.relpath(filepath, labs_dir)
42+
43+
# Remove the file extension for the notebook path
44+
notebook_path = os.path.splitext(rel_path)[0]
45+
46+
with open(filepath, "r") as f:
47+
source = f.read()
48+
49+
notebooks.append({
50+
"path": notebook_path,
51+
"language": LANG_MAP[ext],
52+
"source": source,
53+
})
54+
55+
with zipfile.ZipFile(output_path, "w", zipfile.ZIP_DEFLATED) as zf:
56+
for nb in notebooks:
57+
# Each entry is a JSON file at the notebook's path
58+
entry_path = nb["path"]
59+
entry = json.dumps({
60+
"version": "NotebookV1",
61+
"origId": 0,
62+
"name": os.path.basename(nb["path"]),
63+
"language": nb["language"],
64+
"commands": [
65+
{
66+
"version": "CommandV1",
67+
"origId": 0,
68+
"guid": "",
69+
"subtype": "command",
70+
"commandType": "auto",
71+
"position": 1.0,
72+
"command": nb["source"],
73+
}
74+
],
75+
})
76+
zf.writestr(entry_path, entry)
77+
78+
print(f"Built {output_path}")
79+
print(f" {len(notebooks)} notebooks:")
80+
for nb in notebooks:
81+
print(f" {nb['path']} ({nb['language']})")
82+
83+
84+
if __name__ == "__main__":
85+
parser = argparse.ArgumentParser(description="Build a .dbc archive from labs/")
86+
parser.add_argument("-o", "--output", default="labs.dbc", help="Output .dbc filename")
87+
args = parser.parse_args()
88+
build_dbc(LABS_DIR, args.output)

full_demo/README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -107,7 +107,7 @@ run_lab7.py # 7. DSPy augmentation agent (requires Supervisor Agent from
107107
## How It Works
108108

109109
- `python -m cli upload` pushes Python files to the Databricks workspace via the Databricks SDK
110-
- `python -m cli submit` checks that the cluster is RUNNING (errors if not), injects Neo4j credentials from `.env` as command-line arguments, and submits a one-shot job via the SDK Jobs API
111-
- Each script uses `argparse` to receive credentials and prints PASS/FAIL for each verification check
110+
- `python -m cli submit` checks that the cluster is RUNNING (auto-starts if terminated), passes all non-core `.env` keys as `KEY=VALUE` parameters, and submits a one-shot job via the SDK Jobs API
111+
- Each script parses `KEY=VALUE` parameters from `sys.argv` into `os.environ` at startup, then reads configuration via `os.environ` / `os.getenv()`
112112
- Scripts exit with code 0 on success, code 1 on any failure
113113
- `python -m cli clean` removes the remote workspace directory and deletes job runs matching the `graph_validation:` prefix

full_demo/agent_modules/check_neo4j.py

Lines changed: 12 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -7,29 +7,28 @@
77
python -m cli upload check_neo4j.py && python -m cli submit check_neo4j.py
88
"""
99

10-
import argparse
10+
import os
1111
import sys
1212
import time
1313

14+
# Parse KEY=VALUE parameters from cli.submit into environment variables.
15+
for _arg in sys.argv[1:]:
16+
if "=" in _arg and not _arg.startswith("-"):
17+
_key, _, _value = _arg.partition("=")
18+
os.environ.setdefault(_key, _value)
19+
1420

1521
def main():
16-
parser = argparse.ArgumentParser(description="Neo4j Connectivity Check")
17-
parser.add_argument("--neo4j-uri", required=True, help="Neo4j Aura URI")
18-
parser.add_argument("--neo4j-username", default="neo4j", help="Neo4j username")
19-
parser.add_argument("--neo4j-password", required=True, help="Neo4j password")
20-
parser.add_argument(
21-
"--volume-path",
22-
default="",
23-
help="(unused, accepted for cli.submit compatibility)",
24-
)
25-
args = parser.parse_args()
22+
neo4j_uri = os.environ["NEO4J_URI"]
23+
neo4j_username = os.getenv("NEO4J_USERNAME", "neo4j")
24+
neo4j_password = os.environ["NEO4J_PASSWORD"]
2625

2726
from neo4j import GraphDatabase
2827

2928
print("=" * 60)
3029
print("Neo4j Connectivity Check")
3130
print("=" * 60)
32-
print(f"Neo4j URI: {args.neo4j_uri}")
31+
print(f"Neo4j URI: {neo4j_uri}")
3332
print()
3433

3534
results = [] # (name, passed, detail)
@@ -44,7 +43,7 @@ def record(name, passed, detail=""):
4443
try:
4544
t0 = time.time()
4645
driver = GraphDatabase.driver(
47-
args.neo4j_uri, auth=(args.neo4j_username, args.neo4j_password)
46+
neo4j_uri, auth=(neo4j_username, neo4j_password)
4847
)
4948
driver.verify_connectivity()
5049
elapsed = time.time() - t0

full_demo/agent_modules/generate_embeddings.py

Lines changed: 18 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -12,24 +12,25 @@
1212
3. Neo4j credentials configured (for document type classification)
1313
1414
Usage:
15-
Run as a Databricks job via submit.sh, or directly on a cluster:
16-
17-
python generate_embeddings.py \
18-
--volume-path /Volumes/catalog/schema/volume \
19-
--output-path /Volumes/catalog/schema/volume/embeddings/document_chunks_embedded.json
20-
21-
Download the output JSON and commit it to Includes/data/embeddings/.
15+
python -m cli upload generate_embeddings.py && python -m cli submit generate_embeddings.py
2216
"""
2317

24-
import argparse
2518
import json
19+
import os
2620
import re
21+
import sys
2722
import time
2823
import uuid
2924
from datetime import datetime, timezone
3025
from enum import Enum
3126
from typing import Optional
3227

28+
# Parse KEY=VALUE parameters from cli.submit into environment variables.
29+
for _arg in sys.argv[1:]:
30+
if "=" in _arg and not _arg.startswith("-"):
31+
_key, _, _value = _arg.partition("=")
32+
os.environ.setdefault(_key, _value)
33+
3334
from bs4 import BeautifulSoup
3435

3536

@@ -206,28 +207,23 @@ def generate_embeddings_databricks(texts: list[str], endpoint: str = "databricks
206207
# =============================================================================
207208

208209
def main():
209-
parser = argparse.ArgumentParser(description="Generate pre-computed embeddings for workshop HTML files")
210-
parser.add_argument("--volume-path", required=True, help="Unity Catalog Volume path containing HTML files")
211-
parser.add_argument("--output-path", default=None, help="Output path for JSON file (defaults to volume-path/embeddings/document_chunks_embedded.json)")
212-
parser.add_argument("--endpoint", default="databricks-gte-large-en", help="Databricks embedding model endpoint")
213-
args = parser.parse_args()
214-
215-
output_path = args.output_path or f"{args.volume_path}/embeddings/document_chunks_embedded.json"
210+
volume_path = os.environ["DATABRICKS_VOLUME_PATH"]
211+
endpoint = os.getenv("EMBEDDING_ENDPOINT", "databricks-gte-large-en")
212+
output_path = os.getenv("EMBEDDING_OUTPUT_PATH") or f"{volume_path}/embeddings/document_chunks_embedded.json"
216213

217214
print("=" * 70)
218215
print("EMBEDDING GENERATION - Pre-computing document embeddings")
219216
print("=" * 70)
220-
print(f"Volume path: {args.volume_path}")
217+
print(f"Volume path: {volume_path}")
221218
print(f"Output path: {output_path}")
222-
print(f"Endpoint: {args.endpoint}")
219+
print(f"Endpoint: {endpoint}")
223220
print("")
224221

225222
# Step 1: List HTML files
226223
print("[1/4] Listing HTML files...")
227-
html_path = f"{args.volume_path}/html"
224+
html_path = f"{volume_path}/html"
228225

229226
# Volumes are regular filesystem paths on Databricks clusters
230-
import os
231227
html_files = sorted(
232228
f for f in os.listdir(html_path) if f.endswith(".html")
233229
)
@@ -282,7 +278,7 @@ def main():
282278
start_time = time.time()
283279

284280
texts = [chunk["text"] for chunk in all_chunks]
285-
embeddings = generate_embeddings_databricks(texts, endpoint=args.endpoint)
281+
embeddings = generate_embeddings_databricks(texts, endpoint=endpoint)
286282

287283
for i, chunk in enumerate(all_chunks):
288284
chunk["embedding"] = embeddings[i]
@@ -298,7 +294,7 @@ def main():
298294
output = {
299295
"metadata": {
300296
"generated_at": datetime.now(timezone.utc).isoformat(),
301-
"embedding_model": args.endpoint,
297+
"embedding_model": endpoint,
302298
"embedding_dimensions": dimensions,
303299
"chunk_size": 4000,
304300
"chunk_overlap": 200,
@@ -324,7 +320,7 @@ def main():
324320
print("=" * 70)
325321
print(f" Documents: {len(documents)}")
326322
print(f" Chunks: {len(all_chunks)}")
327-
print(f" Model: {args.endpoint}")
323+
print(f" Model: {endpoint}")
328324
print(f" Dims: {dimensions}")
329325
print("")
330326
print("Next steps:")

full_demo/agent_modules/run_augmentation_agent.py

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,16 @@
77
job JSON (cli.submit handles this automatically).
88
"""
99

10+
import os
11+
import sys
12+
13+
# Parse KEY=VALUE parameters from cli.submit into environment variables.
14+
# databricks_job_runner is not available on the cluster, so we inline the logic.
15+
for _arg in sys.argv[1:]:
16+
if "=" in _arg and not _arg.startswith("-"):
17+
_key, _, _value = _arg.partition("=")
18+
os.environ.setdefault(_key, _value)
19+
1020
from augmentation_agent.__main__ import main
1121

1222
if __name__ == "__main__":

0 commit comments

Comments
 (0)