@@ -18,10 +18,14 @@ When finished, return to Databricks and run `03_pull_and_model`.
1818
1919---
2020
21- ## Step 1: Verify the Graph Loaded Correctly
21+ ## Step 1: Verify and Explore the Graph
22+
23+ Before projecting anything, make sure the ingest landed and get a feel for the
24+ shape of the data.
25+
26+ ### 1a. Node and relationship counts
2227
2328``` cypher
24- // Check node and relationship counts
2529MATCH (a:Account) WITH count(a) AS accounts
2630MATCH (m:Merchant) WITH accounts, count(m) AS merchants
2731MATCH ()-[t:TRANSACTED_WITH]->() WITH accounts, merchants, count(t) AS txns
@@ -31,6 +35,50 @@ RETURN accounts, merchants, txns, p2p
3135
3236** Expected:** ~ 5,000 accounts, ~ 500 merchants, ~ 50,000 transactions, ~ 8,000 transfers.
3337
38+ ### 1b. Fraud vs legitimate account breakdown
39+
40+ ``` cypher
41+ MATCH (a:Account)
42+ RETURN a.is_fraud AS is_fraud,
43+ count(a) AS account_count,
44+ round(avg(a.balance), 2) AS avg_balance,
45+ min(a.holder_age) AS min_age,
46+ max(a.holder_age) AS max_age
47+ ORDER BY is_fraud DESC
48+ ```
49+
50+ ** What to look for:** ~ 200 fraud accounts (4%) vs ~ 4,800 legitimate accounts.
51+ Balances and holder-age ranges should overlap heavily — that's the whole point
52+ of the dataset. The graph is where the separation lives.
53+
54+ ### 1c. Merchant risk-tier distribution
55+
56+ ``` cypher
57+ MATCH (m:Merchant)
58+ RETURN m.risk_tier AS risk_tier,
59+ m.category AS category,
60+ count(m) AS merchant_count
61+ ORDER BY risk_tier, merchant_count DESC
62+ ```
63+
64+ ** What to look for:** ` crypto ` and ` gaming ` categories skew heavily toward
65+ ` risk_tier = high ` — these are the merchants fraud accounts preferentially
66+ transact with.
67+
68+ ### 1d. Sample the subgraph around a fraud account
69+
70+ ``` cypher
71+ MATCH (a:Account {is_fraud: true})
72+ WITH a LIMIT 1
73+ OPTIONAL MATCH (a)-[t:TRANSACTED_WITH]->(m:Merchant)
74+ OPTIONAL MATCH (a)-[p:TRANSFERRED_TO]->(b:Account)
75+ RETURN a, t, m, p, b
76+ ```
77+
78+ ** What to look for:** the fraud account should connect to several ` risk_tier = high `
79+ merchants and have at least one outgoing ` TRANSFERRED_TO ` edge to another
80+ account. Good visual primer before running the algorithms.
81+
3482---
3583
3684## Step 2: Project the Account Transfer Graph
@@ -248,6 +296,79 @@ across all three features.
248296
249297---
250298
299+ ## Step 11: Fraud Detection Queries in Pure Cypher
300+
301+ Before handing the features back to Databricks, it is worth seeing the payoff
302+ in Cypher alone. These two queries combine the GDS-written properties with
303+ the raw graph to surface fraud patterns directly.
304+
305+ ### 11a. Identify Ring Members
306+
307+ A fraud ring is a Louvain community where multiple accounts both send * and*
308+ receive money within the same community. Accounts that only send or only
309+ receive are peripheral; accounts on both sides of a transfer are core ring
310+ participants. The query collects senders and receivers per community, then
311+ intersects them — any account in both lists is a confirmed bidirectional
312+ participant. Communities with three or more such accounts are coordinated
313+ rings, not coincidence.
314+
315+ ``` cypher
316+ MATCH (s:Account)-[:TRANSFERRED_TO]->(r:Account)
317+ WHERE s.community_id IS NOT NULL
318+ AND s.community_id = r.community_id
319+ WITH s.community_id AS community,
320+ collect(DISTINCT s.account_id) AS senders,
321+ collect(DISTINCT r.account_id) AS receivers
322+ WITH community,
323+ [x IN senders WHERE x IN receivers] AS ring_members
324+ WHERE size(ring_members) >= 3
325+ RETURN community,
326+ ring_members,
327+ size(ring_members) AS ring_size
328+ ORDER BY ring_size DESC
329+ ```
330+
331+ ** What to look for:** small communities (tight clusters) with ` ring_size >= 3 ` .
332+ Cross-reference the ` ring_members ` account IDs against the ` is_fraud ` ground
333+ truth and you should see a high precision — the Louvain + bidirectional
334+ intersection combo finds rings without needing labels.
335+
336+ ### 11b. Off-Hours Transaction Detection
337+
338+ Fraud accounts in this dataset skew slightly toward off-hours activity.
339+ Flagging accounts with three or more transactions between midnight and 5am,
340+ then joining the already-written ` risk_score ` and ` community_id ` , gives a
341+ single ranked list that combines structural (graph) and behavioural (time-of-day)
342+ signal.
343+
344+ ``` cypher
345+ MATCH (a:Account)-[t:TRANSACTED_WITH]->(m:Merchant)
346+ WHERE t.txn_hour >= 0 AND t.txn_hour < 6
347+ WITH a,
348+ count(t) AS off_hours_count,
349+ round(avg(t.amount), 2) AS avg_amount,
350+ round(sum(t.amount), 2) AS total_amount,
351+ collect(DISTINCT m.merchant_id) AS merchants_used
352+ WHERE off_hours_count >= 3
353+ RETURN a.account_id AS account_id,
354+ a.is_fraud AS is_fraud,
355+ a.risk_score AS risk_score,
356+ a.community_id AS community_id,
357+ off_hours_count,
358+ avg_amount,
359+ total_amount,
360+ size(merchants_used) AS distinct_merchants
361+ ORDER BY off_hours_count DESC
362+ LIMIT 25
363+ ```
364+
365+ ** What to look for:** accounts with high ` off_hours_count ` that * also* have
366+ a high ` risk_score ` and share a ` community_id ` with other flagged accounts.
367+ Those are the strongest fraud candidates — three independent signals pointing
368+ at the same account.
369+
370+ ---
371+
251372## Done in Aura
252373
253374The graph now has three GDS-computed properties on every Account node:
0 commit comments