SPARKIFY V2: Neo4j Spark Connector Rewrite

Complete Cut-Over Requirements

FOLLOW THE REQUIREMENTS EXACTLY! Do not add new features or functionality beyond the specific requirements requested and documented.
ALWAYS FIX THE CORE ISSUE!
COMPLETE CHANGE: All occurrences must be changed in a single, atomic update
CLEAN IMPLEMENTATION: Simple, direct replacements only
NO MIGRATION PHASES: Do not create temporary compatibility periods
NO ROLLBACK PLANS: Never create rollback plans
NO PARTIAL UPDATES: Change everything or change nothing
NO COMPATIBILITY LAYERS OR BACKWARDS COMPATIBILITY: Do not maintain old and new paths simultaneously
NO BACKUPS OF OLD CODE: Do not comment out old code "just in case"
NO CODE DUPLICATION: Do not duplicate functions to handle both patterns
NO WRAPPER FUNCTIONS: Direct replacements only, no abstraction layers
DO NOT CALL FUNCTIONS ENHANCED OR IMPROVED: Update the actual methods directly. For example if there is a class PropertyIndex and we want to improve that do not create a separate ImprovedPropertyIndex and instead just update the actual PropertyIndex
USE MODULES AND CLEAN CODE!
Never name things after phases or steps: No test_phase_2.py etc.
ALWAYS USE PYDANTIC for Typed Classes

Summary

Rewrite the Table Access Audit graph module to use the Neo4j Spark Connector instead of the Neo4j Python driver. All graph reads and writes will use Spark DataFrames. The Databricks cluster is already configured.

What Stays the Same

Graph Data Model

Six node types: User, Group, ServicePrincipal, Catalog, Schema, Table
Five relationship types: MEMBER_OF, HAS_PRIVILEGE, OWNS, CONTAINS_SCHEMA, CONTAINS_TABLE
All node properties and relationship properties unchanged
All uniqueness constraints unchanged

Databricks Client

The existing Databricks SDK client for extracting Unity Catalog data remains unchanged
Data extraction logic stays the same

Query Capabilities

User accessible tables lookup
Table access list
Access path discovery
Group impact analysis
Graph statistics

What Changes

Connection Management

Remove Neo4j Python driver dependency
Replace with Spark session and Neo4j Spark Connector options
Connection configuration becomes Spark Connector option dictionaries

Write Operations

Replace individual Cypher MERGE statements with DataFrame-based writes
Use Spark Connector's Overwrite mode for MERGE behavior
Batch multiple records into single DataFrame operations

Read Operations

Replace execute_query calls with Spark DataFrame reads
Use custom Cypher via the query option for complex traversals
Results returned as DataFrames instead of dictionaries

Sync Orchestration

Orchestrate DataFrame transformations and writes
Use Spark actions for write execution

Requirements

R1: Spark Session Configuration

Create a configuration class that holds Neo4j Spark Connector options including URI, authentication credentials, and database name. Use Pydantic for the configuration model.

R2: Node Writing

Write all six node types using the Spark Connector. Each node type must use the correct labels and node.keys options to ensure MERGE behavior matches current functionality.

R3: Relationship Writing

Write all five relationship types using the Spark Connector. Each relationship type must specify source node keys, target node keys, and relationship properties where applicable.

R4: Schema Initialization

Create uniqueness constraints for all six node types. Handle constraint creation through Spark Connector or direct Cypher execution.

R5: Graph Clear

Support clearing all nodes and relationships for full sync operations.

R6: Query Execution

Implement all existing query methods using Spark Connector reads. Complex graph traversals use the query option with custom Cypher.

R7: Sync Workflow

Coordinate the full sync process: clear graph, create constraints, write nodes, write relationships.

R8: Statistics and Results

Return sync statistics (nodes created, relationships created) and query results using Pydantic models.

Files to Rewrite

File	Purpose
connection.py	Replace with Spark Connector configuration
schema.py	Update constraint creation for Spark execution
writer.py	Replace with DataFrame-based writes
queries.py	Replace with DataFrame-based reads
sync.py	Update orchestration for Spark operations

Implementation Plan

Phase 1: Configuration and Connection

Objective: Replace Neo4j Python driver connection with Spark Connector configuration.

Description: Remove the Neo4jConnection class that manages Python driver lifecycle. Create a new configuration class using Pydantic that holds all Neo4j Spark Connector options. The configuration provides option dictionaries that can be passed directly to Spark read and write operations.

Todo List

Remove neo4j Python driver imports and dependencies
Create Pydantic configuration model for Spark Connector options
Implement method to generate read options dictionary
Implement method to generate write options dictionary
Update settings to use new configuration model
Code review and testing

Phase 2: Schema Management

Objective: Update constraint and index creation to work through Spark.

Description: The current schema initialization uses the Python driver to execute CREATE CONSTRAINT statements. Update this to execute constraint creation Cypher through the Spark Connector or a Spark-compatible method. Keep the same constraints for User.username, Group.name, ServicePrincipal.application_id, Catalog.name, Schema.full_name, and Table.full_name.

Todo List

Update constraint creation to execute through Spark
Implement graph clear operation for full sync
Verify all six uniqueness constraints are created correctly
Code review and testing

Phase 3: Write Operations

Objective: Replace all node and relationship writing with DataFrame-based operations.

Description: The current writer uses individual MERGE Cypher statements via execute_query. Replace this with DataFrame writes using the Spark Connector. Create DataFrames from the extracted Databricks data and write them using the neo4j format with appropriate options for labels, node.keys, relationship type, and source/target node keys.

Todo List

Phase 4: Read Operations

Objective: Replace all query methods with Spark Connector reads.

Description: The current queries module uses execute_query with Cypher strings. Replace this with Spark DataFrame reads using the neo4j format. Simple node lookups use the labels option. Complex traversals like access path discovery and nested group membership use the query option with custom Cypher including Quantified Path Patterns.

Todo List

Implement graph statistics query
Implement user accessible tables query
Implement table access list query
Implement access path discovery query
Implement group impact analysis query
Implement custom query execution method
Code review and testing

Phase 5: Sync Integration

Objective: Update sync orchestration to use Spark-based operations.

Description: The current GraphSync class coordinates the full sync workflow by calling writer and schema methods. Update this to orchestrate DataFrame operations. The sync workflow remains: initialize schema, clear graph if full sync, extract data from Databricks, transform to DataFrames, write nodes, write relationships, return statistics.

Todo List

Update sync to use new configuration
Update sync to use DataFrame-based writes
Update sync statistics collection
Update sync result model
Verify full sync produces correct graph
Code review and testing

Success Criteria

Full sync creates identical graph structure to current implementation
All query methods return equivalent results
No Neo4j Python driver imports remain in the graph module
All typed classes use Pydantic models

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SPARKIFY V2: Neo4j Spark Connector Rewrite

Complete Cut-Over Requirements

Summary

What Stays the Same

What Changes

Requirements

R1: Spark Session Configuration

R2: Node Writing

R3: Relationship Writing

R4: Schema Initialization

R5: Graph Clear

R6: Query Execution

R7: Sync Workflow

R8: Statistics and Results

Files to Rewrite

Implementation Plan

Phase 1: Configuration and Connection

Phase 2: Schema Management

Phase 3: Write Operations

Phase 4: Read Operations

Phase 5: Sync Integration

Success Criteria

FilesExpand file tree

SPARKIFY_V2.md

Latest commit

History

SPARKIFY_V2.md

File metadata and controls

SPARKIFY V2: Neo4j Spark Connector Rewrite

Complete Cut-Over Requirements

Summary

What Stays the Same

What Changes

Requirements

R1: Spark Session Configuration

R2: Node Writing

R3: Relationship Writing

R4: Schema Initialization

R5: Graph Clear

R6: Query Execution

R7: Sync Workflow

R8: Statistics and Results

Files to Rewrite

Implementation Plan

Phase 1: Configuration and Connection

Phase 2: Schema Management

Phase 3: Write Operations

Phase 4: Read Operations

Phase 5: Sync Integration

Success Criteria