- FOLLOW THE REQUIREMENTS EXACTLY! Do not add new features or functionality beyond the specific requirements requested and documented.
- ALWAYS FIX THE CORE ISSUE!
- COMPLETE CHANGE: All occurrences must be changed in a single, atomic update
- CLEAN IMPLEMENTATION: Simple, direct replacements only
- NO MIGRATION PHASES: Do not create temporary compatibility periods
- NO ROLLBACK PLANS: Never create rollback plans
- NO PARTIAL UPDATES: Change everything or change nothing
- NO COMPATIBILITY LAYERS OR BACKWARDS COMPATIBILITY: Do not maintain old and new paths simultaneously
- NO BACKUPS OF OLD CODE: Do not comment out old code "just in case"
- NO CODE DUPLICATION: Do not duplicate functions to handle both patterns
- NO WRAPPER FUNCTIONS: Direct replacements only, no abstraction layers
- DO NOT CALL FUNCTIONS ENHANCED OR IMPROVED: Update the actual methods directly. For example if there is a class PropertyIndex and we want to improve that do not create a separate ImprovedPropertyIndex and instead just update the actual PropertyIndex
- USE MODULES AND CLEAN CODE!
- Never name things after phases or steps: No test_phase_2.py etc.
- ALWAYS USE PYDANTIC for Typed Classes
Rewrite the Table Access Audit graph module to use the Neo4j Spark Connector instead of the Neo4j Python driver. All graph reads and writes will use Spark DataFrames. The Databricks cluster is already configured.
Graph Data Model
- Six node types: User, Group, ServicePrincipal, Catalog, Schema, Table
- Five relationship types: MEMBER_OF, HAS_PRIVILEGE, OWNS, CONTAINS_SCHEMA, CONTAINS_TABLE
- All node properties and relationship properties unchanged
- All uniqueness constraints unchanged
Databricks Client
- The existing Databricks SDK client for extracting Unity Catalog data remains unchanged
- Data extraction logic stays the same
Query Capabilities
- User accessible tables lookup
- Table access list
- Access path discovery
- Group impact analysis
- Graph statistics
Connection Management
- Remove Neo4j Python driver dependency
- Replace with Spark session and Neo4j Spark Connector options
- Connection configuration becomes Spark Connector option dictionaries
Write Operations
- Replace individual Cypher MERGE statements with DataFrame-based writes
- Use Spark Connector's Overwrite mode for MERGE behavior
- Batch multiple records into single DataFrame operations
Read Operations
- Replace execute_query calls with Spark DataFrame reads
- Use custom Cypher via the query option for complex traversals
- Results returned as DataFrames instead of dictionaries
Sync Orchestration
- Orchestrate DataFrame transformations and writes
- Use Spark actions for write execution
Create a configuration class that holds Neo4j Spark Connector options including URI, authentication credentials, and database name. Use Pydantic for the configuration model.
Write all six node types using the Spark Connector. Each node type must use the correct labels and node.keys options to ensure MERGE behavior matches current functionality.
Write all five relationship types using the Spark Connector. Each relationship type must specify source node keys, target node keys, and relationship properties where applicable.
Create uniqueness constraints for all six node types. Handle constraint creation through Spark Connector or direct Cypher execution.
Support clearing all nodes and relationships for full sync operations.
Implement all existing query methods using Spark Connector reads. Complex graph traversals use the query option with custom Cypher.
Coordinate the full sync process: clear graph, create constraints, write nodes, write relationships.
Return sync statistics (nodes created, relationships created) and query results using Pydantic models.
| File | Purpose |
|---|---|
| connection.py | Replace with Spark Connector configuration |
| schema.py | Update constraint creation for Spark execution |
| writer.py | Replace with DataFrame-based writes |
| queries.py | Replace with DataFrame-based reads |
| sync.py | Update orchestration for Spark operations |
Objective: Replace Neo4j Python driver connection with Spark Connector configuration.
Description: Remove the Neo4jConnection class that manages Python driver lifecycle. Create a new configuration class using Pydantic that holds all Neo4j Spark Connector options. The configuration provides option dictionaries that can be passed directly to Spark read and write operations.
Todo List
- Remove neo4j Python driver imports and dependencies
- Create Pydantic configuration model for Spark Connector options
- Implement method to generate read options dictionary
- Implement method to generate write options dictionary
- Update settings to use new configuration model
- Code review and testing
Objective: Update constraint and index creation to work through Spark.
Description: The current schema initialization uses the Python driver to execute CREATE CONSTRAINT statements. Update this to execute constraint creation Cypher through the Spark Connector or a Spark-compatible method. Keep the same constraints for User.username, Group.name, ServicePrincipal.application_id, Catalog.name, Schema.full_name, and Table.full_name.
Todo List
- Update constraint creation to execute through Spark
- Implement graph clear operation for full sync
- Verify all six uniqueness constraints are created correctly
- Code review and testing
Objective: Replace all node and relationship writing with DataFrame-based operations.
Description: The current writer uses individual MERGE Cypher statements via execute_query. Replace this with DataFrame writes using the Spark Connector. Create DataFrames from the extracted Databricks data and write them using the neo4j format with appropriate options for labels, node.keys, relationship type, and source/target node keys.
Todo List
- Implement User node writing with DataFrame
- Implement Group node writing with DataFrame
- Implement ServicePrincipal node writing with DataFrame
- Implement Catalog node writing with DataFrame
- Implement Schema node writing with DataFrame
- Implement Table node writing with DataFrame
- Implement MEMBER_OF relationship writing with DataFrame
- Implement CONTAINS_SCHEMA relationship writing with DataFrame
- Implement CONTAINS_TABLE relationship writing with DataFrame
- Implement HAS_PRIVILEGE relationship writing with DataFrame
- Implement OWNS relationship writing with DataFrame
- Code review and testing
Objective: Replace all query methods with Spark Connector reads.
Description: The current queries module uses execute_query with Cypher strings. Replace this with Spark DataFrame reads using the neo4j format. Simple node lookups use the labels option. Complex traversals like access path discovery and nested group membership use the query option with custom Cypher including Quantified Path Patterns.
Todo List
- Implement graph statistics query
- Implement user accessible tables query
- Implement table access list query
- Implement access path discovery query
- Implement group impact analysis query
- Implement custom query execution method
- Code review and testing
Objective: Update sync orchestration to use Spark-based operations.
Description: The current GraphSync class coordinates the full sync workflow by calling writer and schema methods. Update this to orchestrate DataFrame operations. The sync workflow remains: initialize schema, clear graph if full sync, extract data from Databricks, transform to DataFrames, write nodes, write relationships, return statistics.
Todo List
- Update sync to use new configuration
- Update sync to use DataFrame-based writes
- Update sync statistics collection
- Update sync result model
- Verify full sync produces correct graph
- Code review and testing
- Full sync creates identical graph structure to current implementation
- All query methods return equivalent results
- No Neo4j Python driver imports remain in the graph module
- All typed classes use Pydantic models