Proposal: Neo4j Knowledge Graph for Databricks Unity Catalog Permissions

Executive Summary

This proposal extends the existing Table Access Audit Tool to store Databricks Unity Catalog permissions data in a Neo4j knowledge graph. By modeling users, groups, catalogs, schemas, tables, and permissions as nodes and relationships, we enable powerful graph traversal for access path discovery, inheritance visualization, and impact analysis.

This is a demo-focused implementation prioritizing simplicity and rapid iteration over enterprise features.

Why a Knowledge Graph?

The Limitations of Flat Data

The current implementation collects permissions into Python data classes and outputs flat reports. This approach struggles with:

Inheritance Complexity: Permissions cascade from catalogs to schemas to tables. Flat data requires repeated joins to trace inheritance.
Group Membership Depth: Users belong to groups, which can nest. Resolving effective access requires recursive traversal.
Access Path Discovery: Finding all ways a user can reach a table is computationally expensive in flat structures.

The Power of Relationships

Neo4j excels where flat data struggles:

Challenge	Flat Data	Graph
"How does User X access Table Y?"	Multiple joins	Single path query
"Who is affected if we remove Group G?"	Recursive queries	Native traversal
"Show permission inheritance paths"	Complex procedures	Built-in visualization

Data Model

Node Types

Identity Nodes

Node Label	Description	Key Properties
User	Individual user account	id, username, displayName, email
Group	Security group	id, name
ServicePrincipal	Machine identity	id, applicationId, displayName

Data Asset Nodes

Node Label	Description	Key Properties
Catalog	Database container	name, owner
Schema	Schema within a catalog	name, fullName, owner
Table	Table or view	name, fullName, tableType, owner

Relationship Types

Structural Relationships

Relationship	From → To	Description
CONTAINS_SCHEMA	Catalog → Schema	Schema belongs to catalog
CONTAINS_TABLE	Schema → Table	Table belongs to schema

Membership Relationships

Relationship	From → To	Description
MEMBER_OF	User → Group	User is member of group
MEMBER_OF	Group → Group	Nested group membership

Permission Relationships

Relationship	From → To	Description	Properties
HAS_PRIVILEGE	Principal → Asset	Permission grant	privilege
OWNS	Principal → Asset	Ownership	-

Example Graph Structure

(alice:User) -[:MEMBER_OF]-> (data-engineers:Group)
                                      |
                                      v
                            -[:HAS_PRIVILEGE {privilege: "USE_CATALOG"}]->
                                      |
                                      v
                          (analytics:Catalog)
                                      |
                            -[:CONTAINS_SCHEMA]->
                                      |
                                      v
                              (sales:Schema)
                                      |
                            -[:CONTAINS_TABLE]->
                                      |
                                      v
                            (orders:Table)

Query Capabilities

The graph enables queries that would be complex with flat data:

Access Path Discovery

"Through what paths can a user access a specific table?"

Finds all paths connecting a user to a table, revealing direct grants, group-based access, and inheritance chains.

Impact Analysis

"Which users would lose access if we remove a group's privilege?"

Identifies all users who depend on a specific group for access to resources.

Privilege Overview

"What tables can User X access?"

Lists all tables reachable by a user through any combination of direct grants and group memberships.

Architecture

Component Overview

┌─────────────────────────────────────────────────────────────┐
│                   Table Access Audit Tool                    │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  ┌─────────────────────────────────────────────────────┐    │
│  │              Data Collection Layer                   │    │
│  │                                                      │    │
│  │  ┌────────────┐  ┌────────────┐  ┌────────────┐     │    │
│  │  │  Identity  │  │  Catalog   │  │ Permission │     │    │
│  │  │  Collector │  │  Scanner   │  │  Resolver  │     │    │
│  │  │            │  │            │  │            │     │    │
│  │  │  Users     │  │  Catalogs  │  │  Grants    │     │    │
│  │  │  Groups    │  │  Schemas   │  │  Ownership │     │    │
│  │  │            │  │  Tables    │  │            │     │    │
│  │  └─────┬──────┘  └─────┬──────┘  └─────┬──────┘     │    │
│  │        └───────────────┼───────────────┘            │    │
│  └────────────────────────┼────────────────────────────┘    │
│                           ▼                                  │
│  ┌─────────────────────────────────────────────────────┐    │
│  │               Neo4j Sync Layer                       │    │
│  │                                                      │    │
│  │  ┌────────────┐  ┌────────────┐  ┌────────────┐     │    │
│  │  │ Connection │  │   Graph    │  │   Query    │     │    │
│  │  │  Manager   │  │   Writer   │  │  Executor  │     │    │
│  │  └────────────┘  └────────────┘  └────────────┘     │    │
│  │                                                      │    │
│  └─────────────────────────────────────────────────────┘    │
│                                                              │
│  ┌─────────────────────────────────────────────────────┐    │
│  │                    CLI Layer                         │    │
│  │                                                      │    │
│  │    sync     query     user-access     table-access   │    │
│  │                                                      │    │
│  └─────────────────────────────────────────────────────┘    │
│                                                              │
└─────────────────────────────────────────────────────────────┘

Data Flow

Collection: Databricks SDK client collects identity, catalog, and permission data
Sync: Graph writer loads data into Neo4j using MERGE operations
Query: Query executor runs Cypher queries for analysis

Neo4j Deployment

For this demo, we target:

Option	Use Case
Neo4j Desktop	Local development
Neo4j AuraDB Free	Cloud-hosted evaluation (200K nodes)
Docker	CI/CD and team testing

Synchronization Strategy

Full Sync

The demo implements full sync only - a complete refresh of all data from Databricks to Neo4j on each run:

Clear existing graph - Delete all nodes and relationships
Collect from Databricks - Gather users, groups, catalogs, schemas, tables, and grants
Write to Neo4j - Create all nodes and relationships

This approach is simple and ensures the graph always reflects the current Databricks state. Databricks is the source of truth.

Sync Process

Verify connectivity to both Databricks and Neo4j
Clear the graph (delete all existing data)
Sync identities - Create User, Group, ServicePrincipal nodes
Sync group memberships - Create MEMBER_OF relationships
Sync catalog structure - Create Catalog, Schema, Table nodes and CONTAINS relationships
Sync permissions - Create HAS_PRIVILEGE and OWNS relationships
Log summary - Report counts of entities synced

Implementation Plan

Phase 1: Neo4j Foundation ✅ COMPLETED

Objective: Establish Neo4j connectivity and basic schema

Status: Complete

Implementation Details:

Added neo4j, pydantic, pydantic-settings, and python-dotenv to project dependencies
Created settings.py module with Pydantic settings for Neo4j configuration (loads from .env)
Created graph/ package with connection manager and schema modules
Implemented Neo4jConnection class with context manager support, connection pooling, and query execution
Implemented GraphSchema class with constraint definitions for all node types
Added CLI commands: graph-test, graph-init, graph-status

Files Created:

src/table_access_audit/settings.py - Pydantic settings for Neo4j and Databricks
src/table_access_audit/graph/__init__.py - Graph package exports
src/table_access_audit/graph/connection.py - Neo4j connection manager
src/table_access_audit/graph/schema.py - Graph schema and constraints

CLI Commands Added:

# Test Neo4j connection
uv run table-access-audit graph-test

# Initialize graph schema (create constraints)
uv run table-access-audit graph-init

# Show graph status and statistics
uv run table-access-audit graph-status

Todo List

Add neo4j Python driver to project dependencies
Create Neo4j connection manager with environment variable configuration
Implement schema initialization to create uniqueness constraints
Write basic connection test
Code review and testing

Phase 2: Graph Sync Implementation ✅ COMPLETED

Objective: Implement full sync from Databricks to Neo4j

Status: Complete

Implementation Details:

Created GraphWriter class with MERGE operations for all node and relationship types
Implemented GraphSync orchestrator that coordinates the full sync process
Full sync clears graph, then syncs: identities → memberships → catalog structure → permissions
Progress logging during each sync phase
Support for catalog filtering and system schema exclusion

Files Created:

src/table_access_audit/graph/writer.py - GraphWriter with MERGE operations for nodes and relationships
src/table_access_audit/graph/sync.py - GraphSync orchestrator and SyncResult dataclass

CLI Commands Added:

# Full sync from Databricks to Neo4j
uv run table-access-audit sync

# Sync only a specific catalog
uv run table-access-audit sync --catalog <catalog_name>

# Include system catalog and information_schema
uv run table-access-audit sync --include-system

Sync Process:

Initialize Neo4j schema (create constraints if needed)
Clear existing graph data
Sync identities: Users, Groups, Service Principals
Sync group memberships: MEMBER_OF relationships
Sync catalog structure: Catalogs → Schemas → Tables with CONTAINS relationships
Sync permissions: HAS_PRIVILEGE and OWNS relationships
Report summary with counts

Todo List

Create graph writer module with MERGE operations for nodes
Implement identity sync (users, groups, service principals)
Implement group membership sync (MEMBER_OF relationships)
Implement catalog structure sync (catalogs, schemas, tables)
Implement permission sync (HAS_PRIVILEGE, OWNS relationships)
Create full sync orchestrator that clears graph and runs all syncs
Add progress logging during sync
Code review and testing

Phase 3: Query Implementation ✅ COMPLETED

Objective: Create essential Cypher queries for access analysis

Status: Complete

Implementation Details:

Created QueryExecutor class with parameterized Cypher query support
Pydantic models for query results: TableAccess, PrincipalAccess, AccessPath, ImpactedUser, QueryResult
Transitive group membership traversal using MEMBER_OF*1..5 pattern
Combined direct and group-based access in unified results
Graph statistics query for monitoring

Files Created:

src/table_access_audit/graph/queries.py - QueryExecutor and Pydantic result models

Query Methods:

Method	Description
`get_user_accessible_tables(username, catalog_filter)`	All tables a user can access (direct + via groups)
`get_table_access_list(table_full_name)`	All principals with access to a table
`get_access_paths(username, table_full_name)`	All paths from user to table
`get_group_impact(group_name)`	Users impacted by removing group's access
`get_graph_statistics()`	Node and relationship counts
`find_users_with_privilege(privilege, securable_type)`	Find users with specific privilege
`execute_custom_query(query, parameters)`	Run arbitrary Cypher queries

Updated local_demo.py with query options:

uv run src/local_demo.py                          # Full sync + demo queries
uv run src/local_demo.py --query-only             # Skip sync, run demo queries
uv run src/local_demo.py --stats                  # Show graph statistics
uv run src/local_demo.py --user-access USER       # Tables accessible by user
uv run src/local_demo.py --table-access TABLE     # Who can access a table
uv run src/local_demo.py --paths USER TABLE       # Access paths from user to table
uv run src/local_demo.py --group-impact GROUP     # Users impacted by group

Todo List

Create query executor with parameterized query support
Implement "user access" query - all tables a user can access
Implement "table access" query - all principals with access to a table
Implement "access paths" query - paths from user to table
Implement "group impact" query - users affected by group permission change
Code review and testing

Phase 4: CLI Integration

Objective: Add graph commands to existing CLI

Todo List

Add sync command to run full sync (completed in Phase 2)
Add user-access command to show tables accessible by a user
Add table-access command to show who can access a table
Add paths command to show access paths from user to table
Update CLI help documentation
Code review and testing

Phase 5: Demo Polish

Objective: Prepare for demonstration

Todo List

Create example queries document for Neo4j Browser exploration
Write setup instructions for Neo4j Desktop and AuraDB Free
Create demo script showing key capabilities (src/local_demo.py)
Test end-to-end flow with real Databricks workspace
Code review and testing

Dependencies

Required Python Packages

Package	Purpose
neo4j	Official Neo4j Python driver
databricks-sdk	Databricks API client (existing)
pydantic	Data validation and serialization
pydantic-settings	Settings management with environment variable loading
python-dotenv	Load environment variables from .env files

Infrastructure

Component	Requirement
Neo4j	Version 5.x (Desktop, AuraDB Free, or Docker)
Python	3.11+

Configuration

Environment Variables

Variable	Description	Example
`NEO4J_URI`	Neo4j connection URI	`neo4j://localhost:7687`
`NEO4J_USERNAME`	Neo4j username	`neo4j`
`NEO4J_PASSWORD`	Neo4j password	`password`
`DATABRICKS_HOST`	Databricks workspace URL	(existing)
`DATABRICKS_TOKEN`	Databricks PAT	(existing)

Success Criteria

The demo is successful when:

Full sync completes and populates the Neo4j graph
Graph can be explored visually in Neo4j Browser
CLI queries return correct access information
Access paths from user to table are discoverable

Future Enhancements

After the demo, potential enhancements include:

Incremental sync for efficiency
More query types (orphaned permissions, compliance reports)
Export to CSV/JSON
Temporal snapshots for audit history
Integration with Neo4j Bloom for visual exploration

References

Neo4j

Neo4j Python Driver: https://neo4j.com/docs/python-manual/current/
Neo4j AuraDB Free: https://neo4j.com/cloud/aura-free/
Cypher Manual: https://neo4j.com/docs/cypher-manual/current/

Databricks

Unity Catalog Privileges: https://docs.databricks.com/aws/en/data-governance/unity-catalog/manage-privileges/privileges
Databricks SDK: https://docs.databricks.com/aws/en/dev-tools/sdk-python

FilesExpand file tree

PERMS_GRAPH.md

Latest commit

History

PERMS_GRAPH.md

File metadata and controls

Proposal: Neo4j Knowledge Graph for Databricks Unity Catalog Permissions

Executive Summary

Why a Knowledge Graph?

The Limitations of Flat Data

The Power of Relationships

Data Model

Node Types

Identity Nodes

Data Asset Nodes

Relationship Types

Structural Relationships

Membership Relationships

Permission Relationships

Example Graph Structure

Query Capabilities

Access Path Discovery

Impact Analysis

Privilege Overview

Architecture

Component Overview

Data Flow

Neo4j Deployment

Synchronization Strategy

Full Sync

Sync Process

Implementation Plan

Phase 1: Neo4j Foundation ✅ COMPLETED

Todo List

Phase 2: Graph Sync Implementation ✅ COMPLETED

Todo List

Phase 3: Query Implementation ✅ COMPLETED

Todo List

Phase 4: CLI Integration

Todo List

Phase 5: Demo Polish

Todo List

Dependencies

Required Python Packages

Infrastructure

Configuration

Environment Variables

Success Criteria

Future Enhancements

References

Neo4j

Databricks