This proposal extends the existing Table Access Audit Tool to store Databricks Unity Catalog permissions data in a Neo4j knowledge graph. By modeling users, groups, catalogs, schemas, tables, and permissions as nodes and relationships, we enable powerful graph traversal for access path discovery, inheritance visualization, and impact analysis.
This is a demo-focused implementation prioritizing simplicity and rapid iteration over enterprise features.
The current implementation collects permissions into Python data classes and outputs flat reports. This approach struggles with:
- Inheritance Complexity: Permissions cascade from catalogs to schemas to tables. Flat data requires repeated joins to trace inheritance.
- Group Membership Depth: Users belong to groups, which can nest. Resolving effective access requires recursive traversal.
- Access Path Discovery: Finding all ways a user can reach a table is computationally expensive in flat structures.
Neo4j excels where flat data struggles:
| Challenge | Flat Data | Graph |
|---|---|---|
| "How does User X access Table Y?" | Multiple joins | Single path query |
| "Who is affected if we remove Group G?" | Recursive queries | Native traversal |
| "Show permission inheritance paths" | Complex procedures | Built-in visualization |
| Node Label | Description | Key Properties |
|---|---|---|
| User | Individual user account | id, username, displayName, email |
| Group | Security group | id, name |
| ServicePrincipal | Machine identity | id, applicationId, displayName |
| Node Label | Description | Key Properties |
|---|---|---|
| Catalog | Database container | name, owner |
| Schema | Schema within a catalog | name, fullName, owner |
| Table | Table or view | name, fullName, tableType, owner |
| Relationship | From → To | Description |
|---|---|---|
| CONTAINS_SCHEMA | Catalog → Schema | Schema belongs to catalog |
| CONTAINS_TABLE | Schema → Table | Table belongs to schema |
| Relationship | From → To | Description |
|---|---|---|
| MEMBER_OF | User → Group | User is member of group |
| MEMBER_OF | Group → Group | Nested group membership |
| Relationship | From → To | Description | Properties |
|---|---|---|---|
| HAS_PRIVILEGE | Principal → Asset | Permission grant | privilege |
| OWNS | Principal → Asset | Ownership | - |
(alice:User) -[:MEMBER_OF]-> (data-engineers:Group)
|
v
-[:HAS_PRIVILEGE {privilege: "USE_CATALOG"}]->
|
v
(analytics:Catalog)
|
-[:CONTAINS_SCHEMA]->
|
v
(sales:Schema)
|
-[:CONTAINS_TABLE]->
|
v
(orders:Table)
The graph enables queries that would be complex with flat data:
"Through what paths can a user access a specific table?"
Finds all paths connecting a user to a table, revealing direct grants, group-based access, and inheritance chains.
"Which users would lose access if we remove a group's privilege?"
Identifies all users who depend on a specific group for access to resources.
"What tables can User X access?"
Lists all tables reachable by a user through any combination of direct grants and group memberships.
┌─────────────────────────────────────────────────────────────┐
│ Table Access Audit Tool │
├─────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Data Collection Layer │ │
│ │ │ │
│ │ ┌────────────┐ ┌────────────┐ ┌────────────┐ │ │
│ │ │ Identity │ │ Catalog │ │ Permission │ │ │
│ │ │ Collector │ │ Scanner │ │ Resolver │ │ │
│ │ │ │ │ │ │ │ │ │
│ │ │ Users │ │ Catalogs │ │ Grants │ │ │
│ │ │ Groups │ │ Schemas │ │ Ownership │ │ │
│ │ │ │ │ Tables │ │ │ │ │
│ │ └─────┬──────┘ └─────┬──────┘ └─────┬──────┘ │ │
│ │ └───────────────┼───────────────┘ │ │
│ └────────────────────────┼────────────────────────────┘ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Neo4j Sync Layer │ │
│ │ │ │
│ │ ┌────────────┐ ┌────────────┐ ┌────────────┐ │ │
│ │ │ Connection │ │ Graph │ │ Query │ │ │
│ │ │ Manager │ │ Writer │ │ Executor │ │ │
│ │ └────────────┘ └────────────┘ └────────────┘ │ │
│ │ │ │
│ └─────────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ CLI Layer │ │
│ │ │ │
│ │ sync query user-access table-access │ │
│ │ │ │
│ └─────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘
- Collection: Databricks SDK client collects identity, catalog, and permission data
- Sync: Graph writer loads data into Neo4j using MERGE operations
- Query: Query executor runs Cypher queries for analysis
For this demo, we target:
| Option | Use Case |
|---|---|
| Neo4j Desktop | Local development |
| Neo4j AuraDB Free | Cloud-hosted evaluation (200K nodes) |
| Docker | CI/CD and team testing |
The demo implements full sync only - a complete refresh of all data from Databricks to Neo4j on each run:
- Clear existing graph - Delete all nodes and relationships
- Collect from Databricks - Gather users, groups, catalogs, schemas, tables, and grants
- Write to Neo4j - Create all nodes and relationships
This approach is simple and ensures the graph always reflects the current Databricks state. Databricks is the source of truth.
- Verify connectivity to both Databricks and Neo4j
- Clear the graph (delete all existing data)
- Sync identities - Create User, Group, ServicePrincipal nodes
- Sync group memberships - Create MEMBER_OF relationships
- Sync catalog structure - Create Catalog, Schema, Table nodes and CONTAINS relationships
- Sync permissions - Create HAS_PRIVILEGE and OWNS relationships
- Log summary - Report counts of entities synced
Objective: Establish Neo4j connectivity and basic schema
Status: Complete
Implementation Details:
- Added
neo4j,pydantic,pydantic-settings, andpython-dotenvto project dependencies - Created
settings.pymodule with Pydantic settings for Neo4j configuration (loads from.env) - Created
graph/package with connection manager and schema modules - Implemented
Neo4jConnectionclass with context manager support, connection pooling, and query execution - Implemented
GraphSchemaclass with constraint definitions for all node types - Added CLI commands:
graph-test,graph-init,graph-status
Files Created:
src/table_access_audit/settings.py- Pydantic settings for Neo4j and Databrickssrc/table_access_audit/graph/__init__.py- Graph package exportssrc/table_access_audit/graph/connection.py- Neo4j connection managersrc/table_access_audit/graph/schema.py- Graph schema and constraints
CLI Commands Added:
# Test Neo4j connection
uv run table-access-audit graph-test
# Initialize graph schema (create constraints)
uv run table-access-audit graph-init
# Show graph status and statistics
uv run table-access-audit graph-status- Add neo4j Python driver to project dependencies
- Create Neo4j connection manager with environment variable configuration
- Implement schema initialization to create uniqueness constraints
- Write basic connection test
- Code review and testing
Objective: Implement full sync from Databricks to Neo4j
Status: Complete
Implementation Details:
- Created
GraphWriterclass with MERGE operations for all node and relationship types - Implemented
GraphSyncorchestrator that coordinates the full sync process - Full sync clears graph, then syncs: identities → memberships → catalog structure → permissions
- Progress logging during each sync phase
- Support for catalog filtering and system schema exclusion
Files Created:
src/table_access_audit/graph/writer.py- GraphWriter with MERGE operations for nodes and relationshipssrc/table_access_audit/graph/sync.py- GraphSync orchestrator and SyncResult dataclass
CLI Commands Added:
# Full sync from Databricks to Neo4j
uv run table-access-audit sync
# Sync only a specific catalog
uv run table-access-audit sync --catalog <catalog_name>
# Include system catalog and information_schema
uv run table-access-audit sync --include-systemSync Process:
- Initialize Neo4j schema (create constraints if needed)
- Clear existing graph data
- Sync identities: Users, Groups, Service Principals
- Sync group memberships: MEMBER_OF relationships
- Sync catalog structure: Catalogs → Schemas → Tables with CONTAINS relationships
- Sync permissions: HAS_PRIVILEGE and OWNS relationships
- Report summary with counts
- Create graph writer module with MERGE operations for nodes
- Implement identity sync (users, groups, service principals)
- Implement group membership sync (MEMBER_OF relationships)
- Implement catalog structure sync (catalogs, schemas, tables)
- Implement permission sync (HAS_PRIVILEGE, OWNS relationships)
- Create full sync orchestrator that clears graph and runs all syncs
- Add progress logging during sync
- Code review and testing
Objective: Create essential Cypher queries for access analysis
Status: Complete
Implementation Details:
- Created
QueryExecutorclass with parameterized Cypher query support - Pydantic models for query results:
TableAccess,PrincipalAccess,AccessPath,ImpactedUser,QueryResult - Transitive group membership traversal using
MEMBER_OF*1..5pattern - Combined direct and group-based access in unified results
- Graph statistics query for monitoring
Files Created:
src/table_access_audit/graph/queries.py- QueryExecutor and Pydantic result models
Query Methods:
| Method | Description |
|---|---|
get_user_accessible_tables(username, catalog_filter) |
All tables a user can access (direct + via groups) |
get_table_access_list(table_full_name) |
All principals with access to a table |
get_access_paths(username, table_full_name) |
All paths from user to table |
get_group_impact(group_name) |
Users impacted by removing group's access |
get_graph_statistics() |
Node and relationship counts |
find_users_with_privilege(privilege, securable_type) |
Find users with specific privilege |
execute_custom_query(query, parameters) |
Run arbitrary Cypher queries |
Updated local_demo.py with query options:
uv run src/local_demo.py # Full sync + demo queries
uv run src/local_demo.py --query-only # Skip sync, run demo queries
uv run src/local_demo.py --stats # Show graph statistics
uv run src/local_demo.py --user-access USER # Tables accessible by user
uv run src/local_demo.py --table-access TABLE # Who can access a table
uv run src/local_demo.py --paths USER TABLE # Access paths from user to table
uv run src/local_demo.py --group-impact GROUP # Users impacted by group- Create query executor with parameterized query support
- Implement "user access" query - all tables a user can access
- Implement "table access" query - all principals with access to a table
- Implement "access paths" query - paths from user to table
- Implement "group impact" query - users affected by group permission change
- Code review and testing
Objective: Add graph commands to existing CLI
- Add
synccommand to run full sync (completed in Phase 2) - Add
user-accesscommand to show tables accessible by a user - Add
table-accesscommand to show who can access a table - Add
pathscommand to show access paths from user to table - Update CLI help documentation
- Code review and testing
Objective: Prepare for demonstration
- Create example queries document for Neo4j Browser exploration
- Write setup instructions for Neo4j Desktop and AuraDB Free
- Create demo script showing key capabilities (
src/local_demo.py) - Test end-to-end flow with real Databricks workspace
- Code review and testing
| Package | Purpose |
|---|---|
| neo4j | Official Neo4j Python driver |
| databricks-sdk | Databricks API client (existing) |
| pydantic | Data validation and serialization |
| pydantic-settings | Settings management with environment variable loading |
| python-dotenv | Load environment variables from .env files |
| Component | Requirement |
|---|---|
| Neo4j | Version 5.x (Desktop, AuraDB Free, or Docker) |
| Python | 3.11+ |
| Variable | Description | Example |
|---|---|---|
NEO4J_URI |
Neo4j connection URI | neo4j://localhost:7687 |
NEO4J_USERNAME |
Neo4j username | neo4j |
NEO4J_PASSWORD |
Neo4j password | password |
DATABRICKS_HOST |
Databricks workspace URL | (existing) |
DATABRICKS_TOKEN |
Databricks PAT | (existing) |
The demo is successful when:
- Full sync completes and populates the Neo4j graph
- Graph can be explored visually in Neo4j Browser
- CLI queries return correct access information
- Access paths from user to table are discoverable
After the demo, potential enhancements include:
- Incremental sync for efficiency
- More query types (orphaned permissions, compliance reports)
- Export to CSV/JSON
- Temporal snapshots for audit history
- Integration with Neo4j Bloom for visual exploration
- Neo4j Python Driver: https://neo4j.com/docs/python-manual/current/
- Neo4j AuraDB Free: https://neo4j.com/cloud/aura-free/
- Cypher Manual: https://neo4j.com/docs/cypher-manual/current/
- Unity Catalog Privileges: https://docs.databricks.com/aws/en/data-governance/unity-catalog/manage-privileges/privileges
- Databricks SDK: https://docs.databricks.com/aws/en/dev-tools/sdk-python