Skip to content

Commit a697978

Browse files
committed
feat: new visualizations and dataframe manipulations and local script setup
1 parent a1fe19b commit a697978

4 files changed

Lines changed: 815 additions & 239 deletions

File tree

src/neo4j-quickstart/README.md

Lines changed: 11 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,16 +2,25 @@
22

33
For open pulse project, you will manipulate graph data with Neo4J. The example notebook will guide you on how to download and visualize the graph.
44

5-
## Set-up of dependencies
5+
## Working on Renku with Jupyter Notebook
6+
7+
This set-up provides jupyter notebook support if this is your preferred working method.
8+
9+
Please see instructions in `2025-hackathon.md` at the root of the repository for more instructions.
10+
11+
## Working locally
612

713
We are using uv ([installation instructions here](https://docs.astral.sh/uv/getting-started/installation/)).
814

915
1. Create virtual environment: `uv venv`
1016
2. Activate it: `source .venv/bin/activate`
1117
3. Get all predefined dependencies from the `uv.lock` file by running the command: `uv sync`
18+
4. Run the code with `python quickstart.py`
19+
20+
Note: uv and Jupyter Notebook integration is not covered in this set-up so you will need to work with python scripts only.
1221

1322
## Build docker
1423

15-
Locally: `docker build -f tools/images/Dockerfile -t test .`
24+
Locally: `docker build -f tools/images/Dockerfile -t neo4j-quickstart .`
1625

1726
Else there is an integrated github CI in the github workflows of this repository.

src/neo4j-quickstart/quickstart.ipynb

Lines changed: 87 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -59,24 +59,16 @@
5959
"source": [
6060
"import neo4j\n",
6161
"from utils.neo4jdownloader import Neo4JDownloader\n",
62-
"#from dotenv import load_dotenv\n",
6362
"from pathlib import Path\n",
6463
"import os\n",
6564
"\n",
66-
"#load_dotenv() # Load environment variables from .env file\n",
67-
"\n",
6865
"def get_downloader():\n",
6966
" secrets_dir = Path(\"/secrets\")\n",
7067
" NEO4J_URI = (secrets_dir / \"neo4j_uri\").read_text()\n",
7168
" NEO4J_USERNAME = (secrets_dir / \"neo4j_user\").read_text()\n",
7269
" NEO4J_PASSWORD = (secrets_dir / \"neo4j_password\").read_text()\n",
7370
" NEO4J_DATABASE = (secrets_dir / \"neo4j_database\").read_text()\n",
7471
"\n",
75-
" # NEO4J_URI = os.environ.get(\"NEO4J_URI\")\n",
76-
" # NEO4J_USERNAME = os.environ.get(\"NEO4J_USER\")\n",
77-
" # NEO4J_PASSWORD = os.environ.get(\"NEO4J_PASSWORD\")\n",
78-
" # NEO4J_DATABASE = neo4j_database\n",
79-
"\n",
8072
" return Neo4JDownloader(NEO4J_URI, NEO4J_USERNAME, NEO4J_PASSWORD, NEO4J_DATABASE)\n",
8173
"\n",
8274
"def extract_data(nodes, relationships):\n",
@@ -117,6 +109,14 @@
117109
"df.head()"
118110
]
119111
},
112+
{
113+
"cell_type": "markdown",
114+
"id": "2e0e247e",
115+
"metadata": {},
116+
"source": [
117+
"Let's make a graph for the first 200 nodes of the graph: "
118+
]
119+
},
120120
{
121121
"cell_type": "code",
122122
"execution_count": null,
@@ -128,6 +128,57 @@
128128
"graph = df_to_pydantic_models(df.head(200), relationships)"
129129
]
130130
},
131+
{
132+
"cell_type": "markdown",
133+
"id": "43d755ac",
134+
"metadata": {},
135+
"source": [
136+
"Let's see how we can filter our dataframe (as a classic pandas dataframe) to get all information about EPFL or SDSC. We will then use one of these to continue visualizations."
137+
]
138+
},
139+
{
140+
"cell_type": "code",
141+
"execution_count": null,
142+
"id": "ede627ff",
143+
"metadata": {},
144+
"outputs": [],
145+
"source": [
146+
"import re\n",
147+
"\n",
148+
"epfl_pattern = r\"EPFL\"\n",
149+
"epfl_df = df[\n",
150+
" df['source'].astype(str).str.contains(epfl_pattern, flags=re.IGNORECASE, na=False) |\n",
151+
" df['target'].astype(str).str.contains(epfl_pattern, flags=re.IGNORECASE, na=False)\n",
152+
"]\n",
153+
"epfl_df.head()"
154+
]
155+
},
156+
{
157+
"cell_type": "code",
158+
"execution_count": null,
159+
"id": "f7c8199a",
160+
"metadata": {},
161+
"outputs": [],
162+
"source": [
163+
"sdsc_pattern = r\"(SwissDataScienceCenter|SDSC)\"\n",
164+
"sdsc_df = df[\n",
165+
" df[\"source\"].astype(str).str.contains(sdsc_pattern, flags=re.IGNORECASE, na=False) |\n",
166+
" df[\"target\"].astype(str).str.contains(sdsc_pattern, flags=re.IGNORECASE, na=False)\n",
167+
"]\n",
168+
"sdsc_df.head()"
169+
]
170+
},
171+
{
172+
"cell_type": "code",
173+
"execution_count": null,
174+
"id": "2d2ef7c3",
175+
"metadata": {},
176+
"outputs": [],
177+
"source": [
178+
"sdsc_graph = df_to_pydantic_models(sdsc_df, relationships)\n",
179+
"epfl_graph = df_to_pydantic_models(epfl_df, relationships)"
180+
]
181+
},
131182
{
132183
"cell_type": "markdown",
133184
"id": "4f5f2f38",
@@ -164,8 +215,15 @@
164215
"source": [
165216
"from utils.visualization import visualize_graph\n",
166217
"from pathlib import Path\n",
167-
"output_path = Path(\"plots/graphs/graph_visualization.png\")\n",
168-
"visualize_graph(graph, output_path)"
218+
"\n",
219+
"output_path = Path(\"plots/graphs/graph_200_visualization.png\")\n",
220+
"visualize_graph(graph, output_path)\n",
221+
"\n",
222+
"output_path = Path(\"plots/graphs/sdsc_graph.png\")\n",
223+
"visualize_graph(sdsc_graph, output_path)\n",
224+
"\n",
225+
"output_path = Path(\"plots/graphs/epfl_graph.png\")\n",
226+
"visualize_graph(epfl_graph, output_path)"
169227
]
170228
},
171229
{
@@ -188,7 +246,25 @@
188246
"from utils.visualization import visualize_clusters\n",
189247
"from pathlib import Path\n",
190248
"output_dir = Path(\"plots/clusters/\")\n",
191-
"visualize_clusters(graph, output_dir)"
249+
"\n",
250+
"cluster_prefix_name = \"200_first_nodes\"\n",
251+
"visualize_clusters(graph, output_dir, cluster_prefix_name)\n",
252+
"\n",
253+
"cluster_prefix_name = \"sdsc\"\n",
254+
"visualize_clusters(sdsc_graph, output_dir, cluster_prefix_name)\n",
255+
"\n",
256+
"cluster_prefix_name = \"epfl\"\n",
257+
"visualize_clusters(epfl_graph, output_dir, cluster_prefix_name)\n"
258+
]
259+
},
260+
{
261+
"cell_type": "markdown",
262+
"id": "3e2d2c1e",
263+
"metadata": {},
264+
"source": [
265+
"### Follow up on this example visualization: \n",
266+
"\n",
267+
"We can see for EPFL that just a string matching does not manage to find many of the EPFL affiliated repositories. How can we complement with other tools and other approaches to find a better EPFL graph ? Your turn to play around, good luck !"
192268
]
193269
}
194270
],

src/neo4j-quickstart/quickstart.py

Lines changed: 139 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,139 @@
1+
import neo4j
2+
import re
3+
from pathlib import Path
4+
import os
5+
from dotenv import load_dotenv
6+
load_dotenv() # Load environment variables from .env file
7+
8+
from utils.neo4jdownloader import Neo4JDownloader
9+
from utils.builder_dataframe import neo4j_to_dataframe
10+
from utils.builder_models import df_to_pydantic_models
11+
from utils.visualization import visualize_graph
12+
from utils.visualization import visualize_clusters
13+
14+
# ---------------------------
15+
# EXTRACT DATA FROM NEO4J
16+
# ---------------------------
17+
18+
# Define your nodes
19+
20+
nodes = ["user", "repo", "org"]
21+
22+
# Define your relationships (edges)
23+
24+
relationships = {
25+
"member_of": {"type1": {"source": "user", "target": "org"}},
26+
"owner_of": {
27+
"type1": {"source": "user", "target": "repo"},
28+
"type2": {"source": "org", "target": "repo"},
29+
},
30+
"contributor_of": {
31+
"type1": {"source": "user", "target": "repo"},
32+
"type2": {"source": "org", "target": "repo"},
33+
},
34+
"parent_of": {
35+
"type1": {"source": "repo", "target": "repo"},
36+
},
37+
}
38+
39+
def get_downloader():
40+
41+
NEO4J_URI = os.environ.get("NEO4J_URI")
42+
NEO4J_USERNAME = os.environ.get("NEO4J_USER")
43+
NEO4J_PASSWORD = os.environ.get("NEO4J_PASSWORD")
44+
NEO4J_DATABASE = os.environ.get("NEO4J_DATABASE")
45+
46+
print(NEO4J_URI)
47+
48+
return Neo4JDownloader(NEO4J_URI, NEO4J_USERNAME, NEO4J_PASSWORD, NEO4J_DATABASE)
49+
50+
def extract_data(nodes, relationships):
51+
downloader = get_downloader()
52+
53+
try:
54+
nodes_ids, nodes_features = downloader.retrieve_nodes(nodes)
55+
edges_indices, edges_attributes = downloader.retrieve_edges(relationships)
56+
57+
return nodes_ids, nodes_features, edges_indices, edges_attributes
58+
finally:
59+
downloader.close()
60+
61+
62+
nodes_ids, nodes_features, edges_indices, edges_attributes = extract_data(nodes, relationships)
63+
# example of looking at the output
64+
# print(nodes_ids["org"])
65+
# print(nodes_features["org"])
66+
# print(edges_indices)
67+
68+
# -------------------------------------------
69+
# MAKE NEO4J DATA INTO A PANDAS DATAFRAME
70+
# -------------------------------------------
71+
72+
df = neo4j_to_dataframe(nodes_ids, nodes_features, edges_indices, relationships)
73+
print("Dataframe constructed, shape is :", df.shape)
74+
75+
# -------------------------------------------
76+
# EXPLORE / FILTER PANDAS DATAFRAME
77+
# -------------------------------------------
78+
79+
# Define your pattern and filter the dataframe
80+
81+
epfl_pattern = r"EPFL"
82+
epfl_df = df[
83+
df['source'].astype(str).str.contains(epfl_pattern, flags=re.IGNORECASE, na=False) |
84+
df['target'].astype(str).str.contains(epfl_pattern, flags=re.IGNORECASE, na=False)
85+
]
86+
print(epfl_df.head())
87+
print(epfl_df.shape)
88+
89+
sdsc_pattern = r"(SwissDataScienceCenter|SDSC)"
90+
sdsc_df = df[
91+
df["source"].astype(str).str.contains(sdsc_pattern, flags=re.IGNORECASE, na=False) |
92+
df["target"].astype(str).str.contains(sdsc_pattern, flags=re.IGNORECASE, na=False)
93+
]
94+
print(sdsc_df.head())
95+
print(sdsc_df.shape)
96+
97+
# -----------------------------------------------------------------------
98+
# FEED YOUR DATAFRAME TO THE PYDANTIC MODELS AND VISUALIZE THE GRAPH
99+
# -----------------------------------------------------------------------
100+
101+
# From Dataframes to Graphs (via Pydantic)
102+
graph = df_to_pydantic_models(sdsc_df, relationships)
103+
sdsc_graph = df_to_pydantic_models(sdsc_df, relationships)
104+
epfl_graph = df_to_pydantic_models(epfl_df, relationships)
105+
106+
# Full Graphs
107+
108+
output_path = Path("plots/graphs/graph_200_visualization.png")
109+
visualize_graph(graph, output_path)
110+
111+
output_path = Path("plots/graphs/sdsc_graph.png")
112+
visualize_graph(sdsc_graph, output_path)
113+
114+
output_path = Path("plots/graphs/epfl_graph.png")
115+
visualize_graph(epfl_graph, output_path)
116+
117+
# Clusters
118+
119+
output_dir = Path("plots/clusters/")
120+
121+
cluster_prefix_name = "200_first_nodes"
122+
visualize_clusters(graph, output_dir, cluster_prefix_name)
123+
124+
cluster_prefix_name = "sdsc"
125+
visualize_clusters(sdsc_graph, output_dir, cluster_prefix_name)
126+
127+
cluster_prefix_name = "epfl"
128+
visualize_clusters(epfl_graph, output_dir, cluster_prefix_name)
129+
130+
# -----------------------------------------------------------------------
131+
# DEMO FOLLOW UP
132+
133+
# We can see for EPFL that just a string matching does not manage to find many of the EPFL affiliated repositories.
134+
# How can we complement with other tools and other approaches to find a better EPFL graph ?
135+
# Your turn to play around, good luck !
136+
137+
# -----------------------------------------------------------------------
138+
139+

0 commit comments

Comments
 (0)