Skip to content

Commit 4b2d98d

Browse files
committed
cleanup
1 parent 98824d3 commit 4b2d98d

File tree

7 files changed

+185
-15
lines changed

7 files changed

+185
-15
lines changed

.env.sample

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,5 +3,5 @@
33
# by ./setup_secrets.sh
44

55
NEO4J_URI=neo4j+s://xxxxxxxx.databases.neo4j.io
6-
NEO4J_USER=neo4j
6+
NEO4J_USERNAME=neo4j
77
NEO4J_PASSWORD=replace-with-aura-password

00_setup_and_data.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -24,8 +24,8 @@
2424
CATALOG = "graph_feature_engineering_demo"
2525
SCHEMA = "neo4j_webinar"
2626

27-
NEO4J_URI = dbutils.secrets.get("neo4j-graph-engineering", "uri") # neo4j+s://xxx.databases.neo4j.io
28-
NEO4J_USER = dbutils.secrets.get("neo4j-graph-engineering", "user") # neo4j
27+
NEO4J_URI = dbutils.secrets.get("neo4j-graph-engineering", "uri") # neo4j+s://xxx.databases.neo4j.io
28+
NEO4J_USER = dbutils.secrets.get("neo4j-graph-engineering", "username") # neo4j
2929
NEO4J_PASSWORD = dbutils.secrets.get("neo4j-graph-engineering", "password") # from Aura credentials file
3030

3131
# COMMAND ----------

01_neo4j_ingest.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,7 @@
2626
SCHEMA = "neo4j_webinar"
2727

2828
NEO4J_URI = dbutils.secrets.get("neo4j-graph-engineering", "uri")
29-
NEO4J_USER = dbutils.secrets.get("neo4j-graph-engineering", "user")
29+
NEO4J_USER = dbutils.secrets.get("neo4j-graph-engineering", "username")
3030
NEO4J_PASSWORD = dbutils.secrets.get("neo4j-graph-engineering", "password")
3131

3232
# Common Spark Connector options

03_pull_and_model.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -33,7 +33,7 @@
3333
SCHEMA = "neo4j_webinar"
3434

3535
NEO4J_URI = dbutils.secrets.get("neo4j-graph-engineering", "uri")
36-
NEO4J_USER = dbutils.secrets.get("neo4j-graph-engineering", "user")
36+
NEO4J_USER = dbutils.secrets.get("neo4j-graph-engineering", "username")
3737
NEO4J_PASSWORD = dbutils.secrets.get("neo4j-graph-engineering", "password")
3838

3939
NEO4J_OPTS = {

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -52,7 +52,7 @@ The notebooks read credentials via:
5252

5353
```python
5454
dbutils.secrets.get("neo4j-graph-engineering", "uri")
55-
dbutils.secrets.get("neo4j-graph-engineering", "user")
55+
dbutils.secrets.get("neo4j-graph-engineering", "username")
5656
dbutils.secrets.get("neo4j-graph-engineering", "password")
5757
```
5858

aura_gds_guide.md

Lines changed: 123 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -18,10 +18,14 @@ When finished, return to Databricks and run `03_pull_and_model`.
1818

1919
---
2020

21-
## Step 1: Verify the Graph Loaded Correctly
21+
## Step 1: Verify and Explore the Graph
22+
23+
Before projecting anything, make sure the ingest landed and get a feel for the
24+
shape of the data.
25+
26+
### 1a. Node and relationship counts
2227

2328
```cypher
24-
// Check node and relationship counts
2529
MATCH (a:Account) WITH count(a) AS accounts
2630
MATCH (m:Merchant) WITH accounts, count(m) AS merchants
2731
MATCH ()-[t:TRANSACTED_WITH]->() WITH accounts, merchants, count(t) AS txns
@@ -31,6 +35,50 @@ RETURN accounts, merchants, txns, p2p
3135

3236
**Expected:** ~5,000 accounts, ~500 merchants, ~50,000 transactions, ~8,000 transfers.
3337

38+
### 1b. Fraud vs legitimate account breakdown
39+
40+
```cypher
41+
MATCH (a:Account)
42+
RETURN a.is_fraud AS is_fraud,
43+
count(a) AS account_count,
44+
round(avg(a.balance), 2) AS avg_balance,
45+
min(a.holder_age) AS min_age,
46+
max(a.holder_age) AS max_age
47+
ORDER BY is_fraud DESC
48+
```
49+
50+
**What to look for:** ~200 fraud accounts (4%) vs ~4,800 legitimate accounts.
51+
Balances and holder-age ranges should overlap heavily — that's the whole point
52+
of the dataset. The graph is where the separation lives.
53+
54+
### 1c. Merchant risk-tier distribution
55+
56+
```cypher
57+
MATCH (m:Merchant)
58+
RETURN m.risk_tier AS risk_tier,
59+
m.category AS category,
60+
count(m) AS merchant_count
61+
ORDER BY risk_tier, merchant_count DESC
62+
```
63+
64+
**What to look for:** `crypto` and `gaming` categories skew heavily toward
65+
`risk_tier = high` — these are the merchants fraud accounts preferentially
66+
transact with.
67+
68+
### 1d. Sample the subgraph around a fraud account
69+
70+
```cypher
71+
MATCH (a:Account {is_fraud: true})
72+
WITH a LIMIT 1
73+
OPTIONAL MATCH (a)-[t:TRANSACTED_WITH]->(m:Merchant)
74+
OPTIONAL MATCH (a)-[p:TRANSFERRED_TO]->(b:Account)
75+
RETURN a, t, m, p, b
76+
```
77+
78+
**What to look for:** the fraud account should connect to several `risk_tier = high`
79+
merchants and have at least one outgoing `TRANSFERRED_TO` edge to another
80+
account. Good visual primer before running the algorithms.
81+
3482
---
3583

3684
## Step 2: Project the Account Transfer Graph
@@ -248,6 +296,79 @@ across all three features.
248296

249297
---
250298

299+
## Step 11: Fraud Detection Queries in Pure Cypher
300+
301+
Before handing the features back to Databricks, it is worth seeing the payoff
302+
in Cypher alone. These two queries combine the GDS-written properties with
303+
the raw graph to surface fraud patterns directly.
304+
305+
### 11a. Identify Ring Members
306+
307+
A fraud ring is a Louvain community where multiple accounts both send *and*
308+
receive money within the same community. Accounts that only send or only
309+
receive are peripheral; accounts on both sides of a transfer are core ring
310+
participants. The query collects senders and receivers per community, then
311+
intersects them — any account in both lists is a confirmed bidirectional
312+
participant. Communities with three or more such accounts are coordinated
313+
rings, not coincidence.
314+
315+
```cypher
316+
MATCH (s:Account)-[:TRANSFERRED_TO]->(r:Account)
317+
WHERE s.community_id IS NOT NULL
318+
AND s.community_id = r.community_id
319+
WITH s.community_id AS community,
320+
collect(DISTINCT s.account_id) AS senders,
321+
collect(DISTINCT r.account_id) AS receivers
322+
WITH community,
323+
[x IN senders WHERE x IN receivers] AS ring_members
324+
WHERE size(ring_members) >= 3
325+
RETURN community,
326+
ring_members,
327+
size(ring_members) AS ring_size
328+
ORDER BY ring_size DESC
329+
```
330+
331+
**What to look for:** small communities (tight clusters) with `ring_size >= 3`.
332+
Cross-reference the `ring_members` account IDs against the `is_fraud` ground
333+
truth and you should see a high precision — the Louvain + bidirectional
334+
intersection combo finds rings without needing labels.
335+
336+
### 11b. Off-Hours Transaction Detection
337+
338+
Fraud accounts in this dataset skew slightly toward off-hours activity.
339+
Flagging accounts with three or more transactions between midnight and 5am,
340+
then joining the already-written `risk_score` and `community_id`, gives a
341+
single ranked list that combines structural (graph) and behavioural (time-of-day)
342+
signal.
343+
344+
```cypher
345+
MATCH (a:Account)-[t:TRANSACTED_WITH]->(m:Merchant)
346+
WHERE t.txn_hour >= 0 AND t.txn_hour < 6
347+
WITH a,
348+
count(t) AS off_hours_count,
349+
round(avg(t.amount), 2) AS avg_amount,
350+
round(sum(t.amount), 2) AS total_amount,
351+
collect(DISTINCT m.merchant_id) AS merchants_used
352+
WHERE off_hours_count >= 3
353+
RETURN a.account_id AS account_id,
354+
a.is_fraud AS is_fraud,
355+
a.risk_score AS risk_score,
356+
a.community_id AS community_id,
357+
off_hours_count,
358+
avg_amount,
359+
total_amount,
360+
size(merchants_used) AS distinct_merchants
361+
ORDER BY off_hours_count DESC
362+
LIMIT 25
363+
```
364+
365+
**What to look for:** accounts with high `off_hours_count` that *also* have
366+
a high `risk_score` and share a `community_id` with other flagged accounts.
367+
Those are the strongest fraud candidates — three independent signals pointing
368+
at the same account.
369+
370+
---
371+
251372
## Done in Aura
252373

253374
The graph now has three GDS-computed properties on every Account node:

setup_secrets.sh

Lines changed: 56 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -3,11 +3,39 @@
33
# scope "neo4j-graph-engineering". Requires the Databricks CLI to be
44
# installed and authenticated (databricks auth login or DATABRICKS_HOST/
55
# DATABRICKS_TOKEN env vars).
6+
#
7+
# Usage:
8+
# ./setup_secrets.sh [--profile NAME] [ENV_FILE]
9+
#
10+
# The Databricks profile is resolved once and exported as
11+
# DATABRICKS_CONFIG_PROFILE so every subsequent CLI call in this script
12+
# reuses it without re-prompting. Resolution order:
13+
# 1. --profile / -p flag
14+
# 2. DATABRICKS_CONFIG_PROFILE environment variable
15+
# 3. Interactive prompt (lists available profiles)
616

717
set -euo pipefail
818

919
SCOPE="neo4j-graph-engineering"
10-
ENV_FILE="${1:-.env}"
20+
ENV_FILE=".env"
21+
PROFILE="${DATABRICKS_CONFIG_PROFILE:-}"
22+
23+
while [[ $# -gt 0 ]]; do
24+
case "$1" in
25+
-p|--profile)
26+
PROFILE="$2"
27+
shift 2
28+
;;
29+
-h|--help)
30+
echo "Usage: $0 [--profile NAME] [ENV_FILE]"
31+
exit 0
32+
;;
33+
*)
34+
ENV_FILE="$1"
35+
shift
36+
;;
37+
esac
38+
done
1139

1240
if [ ! -f "$ENV_FILE" ]; then
1341
echo "Error: $ENV_FILE not found."
@@ -20,22 +48,43 @@ if ! command -v databricks >/dev/null 2>&1; then
2048
exit 1
2149
fi
2250

51+
# Resolve the Databricks profile once — every CLI call below inherits it via
52+
# the exported DATABRICKS_CONFIG_PROFILE, so the user is never re-prompted.
53+
if [ -z "$PROFILE" ]; then
54+
echo "Available Databricks profiles:"
55+
databricks auth profiles 2>/dev/null || echo " (could not list profiles — check your ~/.databrickscfg)"
56+
echo
57+
read -r -p "Profile name [DEFAULT]: " PROFILE
58+
PROFILE="${PROFILE:-DEFAULT}"
59+
fi
60+
61+
export DATABRICKS_CONFIG_PROFILE="$PROFILE"
62+
echo "Using Databricks profile: $DATABRICKS_CONFIG_PROFILE"
63+
echo
64+
2365
# Load .env
2466
set -a
2567
# shellcheck disable=SC1090
2668
source "$ENV_FILE"
2769
set +a
2870

2971
: "${NEO4J_URI:?NEO4J_URI is not set in $ENV_FILE}"
30-
: "${NEO4J_USER:?NEO4J_USER is not set in $ENV_FILE}"
72+
: "${NEO4J_USERNAME:?NEO4J_USERNAME is not set in $ENV_FILE}"
3173
: "${NEO4J_PASSWORD:?NEO4J_PASSWORD is not set in $ENV_FILE}"
3274

33-
# Create the scope if it does not already exist
34-
if databricks secrets list-scopes --output json | grep -q "\"name\":\"$SCOPE\""; then
75+
# Create the scope — if it already exists, that is fine.
76+
set +e
77+
create_out=$(databricks secrets create-scope "$SCOPE" 2>&1)
78+
create_rc=$?
79+
set -e
80+
81+
if [ "$create_rc" -eq 0 ]; then
82+
echo "Created secret scope: $SCOPE"
83+
elif [[ "$create_out" == *"already exists"* ]]; then
3584
echo "Secret scope already exists: $SCOPE"
3685
else
37-
echo "Creating secret scope: $SCOPE"
38-
databricks secrets create-scope "$SCOPE"
86+
echo "Error creating scope: $create_out" >&2
87+
exit 1
3988
fi
4089

4190
put_secret() {
@@ -47,7 +96,7 @@ put_secret() {
4796

4897
echo "Writing secrets into $SCOPE:"
4998
put_secret "uri" "$NEO4J_URI"
50-
put_secret "user" "$NEO4J_USER"
99+
put_secret "username" "$NEO4J_USERNAME"
51100
put_secret "password" "$NEO4J_PASSWORD"
52101

53102
echo

0 commit comments

Comments
 (0)