Skip to content

Latest commit

 

History

History
451 lines (327 loc) · 18.1 KB

File metadata and controls

451 lines (327 loc) · 18.1 KB

Proposal: Neo4j Knowledge Graph for Databricks Unity Catalog Permissions

Executive Summary

This proposal extends the existing Table Access Audit Tool to store Databricks Unity Catalog permissions data in a Neo4j knowledge graph. By modeling users, groups, catalogs, schemas, tables, and permissions as nodes and relationships, we enable powerful graph traversal for access path discovery, inheritance visualization, and impact analysis.

This is a demo-focused implementation prioritizing simplicity and rapid iteration over enterprise features.


Why a Knowledge Graph?

The Limitations of Flat Data

The current implementation collects permissions into Python data classes and outputs flat reports. This approach struggles with:

  • Inheritance Complexity: Permissions cascade from catalogs to schemas to tables. Flat data requires repeated joins to trace inheritance.
  • Group Membership Depth: Users belong to groups, which can nest. Resolving effective access requires recursive traversal.
  • Access Path Discovery: Finding all ways a user can reach a table is computationally expensive in flat structures.

The Power of Relationships

Neo4j excels where flat data struggles:

Challenge Flat Data Graph
"How does User X access Table Y?" Multiple joins Single path query
"Who is affected if we remove Group G?" Recursive queries Native traversal
"Show permission inheritance paths" Complex procedures Built-in visualization

Data Model

Node Types

Identity Nodes

Node Label Description Key Properties
User Individual user account id, username, displayName, email
Group Security group id, name
ServicePrincipal Machine identity id, applicationId, displayName

Data Asset Nodes

Node Label Description Key Properties
Catalog Database container name, owner
Schema Schema within a catalog name, fullName, owner
Table Table or view name, fullName, tableType, owner

Relationship Types

Structural Relationships

Relationship From → To Description
CONTAINS_SCHEMA Catalog → Schema Schema belongs to catalog
CONTAINS_TABLE Schema → Table Table belongs to schema

Membership Relationships

Relationship From → To Description
MEMBER_OF User → Group User is member of group
MEMBER_OF Group → Group Nested group membership

Permission Relationships

Relationship From → To Description Properties
HAS_PRIVILEGE Principal → Asset Permission grant privilege
OWNS Principal → Asset Ownership -

Example Graph Structure

(alice:User) -[:MEMBER_OF]-> (data-engineers:Group)
                                      |
                                      v
                            -[:HAS_PRIVILEGE {privilege: "USE_CATALOG"}]->
                                      |
                                      v
                          (analytics:Catalog)
                                      |
                            -[:CONTAINS_SCHEMA]->
                                      |
                                      v
                              (sales:Schema)
                                      |
                            -[:CONTAINS_TABLE]->
                                      |
                                      v
                            (orders:Table)

Query Capabilities

The graph enables queries that would be complex with flat data:

Access Path Discovery

"Through what paths can a user access a specific table?"

Finds all paths connecting a user to a table, revealing direct grants, group-based access, and inheritance chains.

Impact Analysis

"Which users would lose access if we remove a group's privilege?"

Identifies all users who depend on a specific group for access to resources.

Privilege Overview

"What tables can User X access?"

Lists all tables reachable by a user through any combination of direct grants and group memberships.


Architecture

Component Overview

┌─────────────────────────────────────────────────────────────┐
│                   Table Access Audit Tool                    │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  ┌─────────────────────────────────────────────────────┐    │
│  │              Data Collection Layer                   │    │
│  │                                                      │    │
│  │  ┌────────────┐  ┌────────────┐  ┌────────────┐     │    │
│  │  │  Identity  │  │  Catalog   │  │ Permission │     │    │
│  │  │  Collector │  │  Scanner   │  │  Resolver  │     │    │
│  │  │            │  │            │  │            │     │    │
│  │  │  Users     │  │  Catalogs  │  │  Grants    │     │    │
│  │  │  Groups    │  │  Schemas   │  │  Ownership │     │    │
│  │  │            │  │  Tables    │  │            │     │    │
│  │  └─────┬──────┘  └─────┬──────┘  └─────┬──────┘     │    │
│  │        └───────────────┼───────────────┘            │    │
│  └────────────────────────┼────────────────────────────┘    │
│                           ▼                                  │
│  ┌─────────────────────────────────────────────────────┐    │
│  │               Neo4j Sync Layer                       │    │
│  │                                                      │    │
│  │  ┌────────────┐  ┌────────────┐  ┌────────────┐     │    │
│  │  │ Connection │  │   Graph    │  │   Query    │     │    │
│  │  │  Manager   │  │   Writer   │  │  Executor  │     │    │
│  │  └────────────┘  └────────────┘  └────────────┘     │    │
│  │                                                      │    │
│  └─────────────────────────────────────────────────────┘    │
│                                                              │
│  ┌─────────────────────────────────────────────────────┐    │
│  │                    CLI Layer                         │    │
│  │                                                      │    │
│  │    sync     query     user-access     table-access   │    │
│  │                                                      │    │
│  └─────────────────────────────────────────────────────┘    │
│                                                              │
└─────────────────────────────────────────────────────────────┘

Data Flow

  1. Collection: Databricks SDK client collects identity, catalog, and permission data
  2. Sync: Graph writer loads data into Neo4j using MERGE operations
  3. Query: Query executor runs Cypher queries for analysis

Neo4j Deployment

For this demo, we target:

Option Use Case
Neo4j Desktop Local development
Neo4j AuraDB Free Cloud-hosted evaluation (200K nodes)
Docker CI/CD and team testing

Synchronization Strategy

Full Sync

The demo implements full sync only - a complete refresh of all data from Databricks to Neo4j on each run:

  1. Clear existing graph - Delete all nodes and relationships
  2. Collect from Databricks - Gather users, groups, catalogs, schemas, tables, and grants
  3. Write to Neo4j - Create all nodes and relationships

This approach is simple and ensures the graph always reflects the current Databricks state. Databricks is the source of truth.

Sync Process

  1. Verify connectivity to both Databricks and Neo4j
  2. Clear the graph (delete all existing data)
  3. Sync identities - Create User, Group, ServicePrincipal nodes
  4. Sync group memberships - Create MEMBER_OF relationships
  5. Sync catalog structure - Create Catalog, Schema, Table nodes and CONTAINS relationships
  6. Sync permissions - Create HAS_PRIVILEGE and OWNS relationships
  7. Log summary - Report counts of entities synced

Implementation Plan

Phase 1: Neo4j Foundation ✅ COMPLETED

Objective: Establish Neo4j connectivity and basic schema

Status: Complete

Implementation Details:

  • Added neo4j, pydantic, pydantic-settings, and python-dotenv to project dependencies
  • Created settings.py module with Pydantic settings for Neo4j configuration (loads from .env)
  • Created graph/ package with connection manager and schema modules
  • Implemented Neo4jConnection class with context manager support, connection pooling, and query execution
  • Implemented GraphSchema class with constraint definitions for all node types
  • Added CLI commands: graph-test, graph-init, graph-status

Files Created:

  • src/table_access_audit/settings.py - Pydantic settings for Neo4j and Databricks
  • src/table_access_audit/graph/__init__.py - Graph package exports
  • src/table_access_audit/graph/connection.py - Neo4j connection manager
  • src/table_access_audit/graph/schema.py - Graph schema and constraints

CLI Commands Added:

# Test Neo4j connection
uv run table-access-audit graph-test

# Initialize graph schema (create constraints)
uv run table-access-audit graph-init

# Show graph status and statistics
uv run table-access-audit graph-status

Todo List

  • Add neo4j Python driver to project dependencies
  • Create Neo4j connection manager with environment variable configuration
  • Implement schema initialization to create uniqueness constraints
  • Write basic connection test
  • Code review and testing

Phase 2: Graph Sync Implementation ✅ COMPLETED

Objective: Implement full sync from Databricks to Neo4j

Status: Complete

Implementation Details:

  • Created GraphWriter class with MERGE operations for all node and relationship types
  • Implemented GraphSync orchestrator that coordinates the full sync process
  • Full sync clears graph, then syncs: identities → memberships → catalog structure → permissions
  • Progress logging during each sync phase
  • Support for catalog filtering and system schema exclusion

Files Created:

  • src/table_access_audit/graph/writer.py - GraphWriter with MERGE operations for nodes and relationships
  • src/table_access_audit/graph/sync.py - GraphSync orchestrator and SyncResult dataclass

CLI Commands Added:

# Full sync from Databricks to Neo4j
uv run table-access-audit sync

# Sync only a specific catalog
uv run table-access-audit sync --catalog <catalog_name>

# Include system catalog and information_schema
uv run table-access-audit sync --include-system

Sync Process:

  1. Initialize Neo4j schema (create constraints if needed)
  2. Clear existing graph data
  3. Sync identities: Users, Groups, Service Principals
  4. Sync group memberships: MEMBER_OF relationships
  5. Sync catalog structure: Catalogs → Schemas → Tables with CONTAINS relationships
  6. Sync permissions: HAS_PRIVILEGE and OWNS relationships
  7. Report summary with counts

Todo List

  • Create graph writer module with MERGE operations for nodes
  • Implement identity sync (users, groups, service principals)
  • Implement group membership sync (MEMBER_OF relationships)
  • Implement catalog structure sync (catalogs, schemas, tables)
  • Implement permission sync (HAS_PRIVILEGE, OWNS relationships)
  • Create full sync orchestrator that clears graph and runs all syncs
  • Add progress logging during sync
  • Code review and testing

Phase 3: Query Implementation ✅ COMPLETED

Objective: Create essential Cypher queries for access analysis

Status: Complete

Implementation Details:

  • Created QueryExecutor class with parameterized Cypher query support
  • Pydantic models for query results: TableAccess, PrincipalAccess, AccessPath, ImpactedUser, QueryResult
  • Transitive group membership traversal using MEMBER_OF*1..5 pattern
  • Combined direct and group-based access in unified results
  • Graph statistics query for monitoring

Files Created:

  • src/table_access_audit/graph/queries.py - QueryExecutor and Pydantic result models

Query Methods:

Method Description
get_user_accessible_tables(username, catalog_filter) All tables a user can access (direct + via groups)
get_table_access_list(table_full_name) All principals with access to a table
get_access_paths(username, table_full_name) All paths from user to table
get_group_impact(group_name) Users impacted by removing group's access
get_graph_statistics() Node and relationship counts
find_users_with_privilege(privilege, securable_type) Find users with specific privilege
execute_custom_query(query, parameters) Run arbitrary Cypher queries

Updated local_demo.py with query options:

uv run src/local_demo.py                          # Full sync + demo queries
uv run src/local_demo.py --query-only             # Skip sync, run demo queries
uv run src/local_demo.py --stats                  # Show graph statistics
uv run src/local_demo.py --user-access USER       # Tables accessible by user
uv run src/local_demo.py --table-access TABLE     # Who can access a table
uv run src/local_demo.py --paths USER TABLE       # Access paths from user to table
uv run src/local_demo.py --group-impact GROUP     # Users impacted by group

Todo List

  • Create query executor with parameterized query support
  • Implement "user access" query - all tables a user can access
  • Implement "table access" query - all principals with access to a table
  • Implement "access paths" query - paths from user to table
  • Implement "group impact" query - users affected by group permission change
  • Code review and testing

Phase 4: CLI Integration

Objective: Add graph commands to existing CLI

Todo List

  • Add sync command to run full sync (completed in Phase 2)
  • Add user-access command to show tables accessible by a user
  • Add table-access command to show who can access a table
  • Add paths command to show access paths from user to table
  • Update CLI help documentation
  • Code review and testing

Phase 5: Demo Polish

Objective: Prepare for demonstration

Todo List

  • Create example queries document for Neo4j Browser exploration
  • Write setup instructions for Neo4j Desktop and AuraDB Free
  • Create demo script showing key capabilities (src/local_demo.py)
  • Test end-to-end flow with real Databricks workspace
  • Code review and testing

Dependencies

Required Python Packages

Package Purpose
neo4j Official Neo4j Python driver
databricks-sdk Databricks API client (existing)
pydantic Data validation and serialization
pydantic-settings Settings management with environment variable loading
python-dotenv Load environment variables from .env files

Infrastructure

Component Requirement
Neo4j Version 5.x (Desktop, AuraDB Free, or Docker)
Python 3.11+

Configuration

Environment Variables

Variable Description Example
NEO4J_URI Neo4j connection URI neo4j://localhost:7687
NEO4J_USERNAME Neo4j username neo4j
NEO4J_PASSWORD Neo4j password password
DATABRICKS_HOST Databricks workspace URL (existing)
DATABRICKS_TOKEN Databricks PAT (existing)

Success Criteria

The demo is successful when:

  1. Full sync completes and populates the Neo4j graph
  2. Graph can be explored visually in Neo4j Browser
  3. CLI queries return correct access information
  4. Access paths from user to table are discoverable

Future Enhancements

After the demo, potential enhancements include:

  • Incremental sync for efficiency
  • More query types (orphaned permissions, compliance reports)
  • Export to CSV/JSON
  • Temporal snapshots for audit history
  • Integration with Neo4j Bloom for visual exploration

References

Neo4j

Databricks