Skip to content

Latest commit

 

History

History
523 lines (414 loc) · 21.8 KB

File metadata and controls

523 lines (414 loc) · 21.8 KB

Unity Catalog Metadata Synchronization for Neo4j

What Is Metadata Synchronization?

When Databricks refers to "Unity Catalog metadata synchronization," they mean making the schema structure of an external data source (Neo4j) visible as first-class objects within Unity Catalog's three-level namespace (catalog.schema.table). This enables:

  • Discoverability: Users browse Neo4j node labels and relationship types in Catalog Explorer alongside Delta tables
  • Governance: Fine-grained access control (USE_CATALOG, USE_SCHEMA, SELECT) on Neo4j objects
  • Lineage tracking: Data lineage from Neo4j through downstream Spark jobs and dashboards
  • Schema documentation: Column types, descriptions, and tags managed centrally in UC

This is distinct from what the existing JDBC connection provides. The JDBC connection (CREATE CONNECTION ... TYPE JDBC) registers credentials and a driver — it does not expose Neo4j's schema as browsable UC objects.


How This Differs from the Existing JDBC Integration

Aspect JDBC Connection (Current) Metadata Sync (Needed)
What's registered in UC A CONNECTION object only Catalog + schemas + tables + columns as UC objects
Schema browsing in Catalog Explorer No Yes
Access control granularity Connection-level only Table-level and column-level
Lineage tracking No Yes
Query model spark.read.format('jdbc').option('databricks.connection', ...) Could also support SELECT * FROM neo4j_catalog.graph.persons

Why Neo4j Requires a Custom Solution

Databricks provides automatic metadata sync for Lakehouse Federation sources — but only for these databases: MySQL, PostgreSQL, Oracle, Teradata, Redshift, SQL Server, Azure Synapse, BigQuery, Snowflake, Salesforce Data Cloud, and Databricks-to-Databricks.

For federated sources, Unity Catalog syncs the remote schema on every interaction:

"Foreign catalog metadata is synced into Unity Catalog on each interaction with the catalog."

Neo4j is not on the supported list. Additionally, the JDBC connection type explicitly does not support foreign catalogs:

"Foreign catalogs are not supported with JDBC connections."

This means we must build a custom metadata sync pipeline.


Neo4j Schema Introspection

Neo4j provides several methods to extract schema metadata. Here are the recommended approaches, ordered by richness:

Primary: apoc.meta.schema() (APOC Core, Richest Single Call)

CALL apoc.meta.schema({sample: -1})
YIELD value

Returns a nested map containing:

  • Node labels with counts
  • Property names and types (uses Cypher 5 canonical names: STRING, INTEGER, FLOAT)
  • Indexed/unique/existence flags per property
  • Relationship types with direction, connectivity, and counts
  • Relationship properties

Example output structure:

{
  "Aircraft": {
    "type": "node",
    "count": 60,
    "properties": {
      "aircraft_id": {"type": "STRING", "indexed": true, "unique": true, "existence": false},
      "manufacturer": {"type": "STRING", "indexed": false, "unique": false, "existence": false},
      "model": {"type": "STRING", "indexed": false, "unique": false, "existence": false},
      "icao24": {"type": "STRING", "indexed": false, "unique": false, "existence": false}
    },
    "relationships": {
      "DEPARTS_FROM": {
        "direction": "out",
        "count": 2400,
        "labels": ["Airport"],
        "properties": {}
      }
    }
  }
}

Requires: APOC Core plugin installed on the Neo4j instance.

Fallback: Built-in Procedures (No APOC Required)

-- Node label properties
CALL db.schema.nodeTypeProperties()
-- Returns: nodeType, nodeLabels, propertyName, propertyTypes, mandatory

-- Relationship type properties
CALL db.schema.relTypeProperties()
-- Returns: relType, propertyName, propertyTypes, mandatory

These use older internal type names (String, Long, Double) instead of Cypher 5 canonical names.

Note: Some documentation references sourceNodeLabels and targetNodeLabels as output columns for db.schema.relTypeProperties(), but these are not available on Neo4j 5.x Aura (verified on 5.27-aura). To discover relationship source/target patterns, use apoc.meta.schema() or db.schema.visualization() instead.

Constraints and Indexes

-- Constraints (uniqueness, existence, type, key)
SHOW CONSTRAINTS YIELD *

-- Indexes (range, fulltext, text, point, vector)
SHOW INDEXES YIELD *

Recommended Combination

  1. CALL apoc.meta.schema({sample: -1}) — property names, types, indexed/unique flags, counts
  2. SHOW CONSTRAINTS YIELD * — precise constraint types and mandatory enforcement
  3. SHOW INDEXES YIELD * — standalone (non-constraint) indexes

Neo4j to Unity Catalog Type Mapping

Neo4j Type (Internal) Neo4j Cypher 5 Type UC/Spark SQL Type Notes
String STRING STRING No length limit in Neo4j
Long INTEGER BIGINT 64-bit signed
Double FLOAT DOUBLE 64-bit IEEE 754
Boolean BOOLEAN BOOLEAN
Date DATE DATE
LocalDateTime LOCAL DATETIME TIMESTAMP_NTZ
DateTime ZONED DATETIME TIMESTAMP
LocalTime LOCAL TIME STRING Serialize as ISO string
Time ZONED TIME STRING No direct SQL equivalent
Duration DURATION STRING Serialize as ISO 8601 duration
Point POINT STRING Serialize as WKT or JSON
StringArray LIST<STRING> ARRAY<STRING>
LongArray LIST<INTEGER> ARRAY<BIGINT>
DoubleArray LIST<FLOAT> ARRAY<DOUBLE>
BooleanArray LIST<BOOLEAN> ARRAY<BOOLEAN>
ByteArray (byte array) BINARY

Graph-to-Relational Mapping Strategy

Neo4j's graph model (labels, relationships, properties) must be mapped to UC's relational model (catalogs, schemas, tables, columns).

Proposed Mapping

Unity Catalog                  Neo4j
─────────────                  ─────
Catalog: neo4j_catalog    →    Neo4j database
  Schema: nodes           →    Node labels namespace
    Table: aircraft        →    :Aircraft label
      Column: aircraft_id  →    aircraft_id property (STRING)
      Column: manufacturer →    manufacturer property (STRING)
    Table: flight          →    :Flight label
    Table: airport         →    :Airport label
  Schema: relationships   →    Relationship types namespace
    Table: departs_from    →    :DEPARTS_FROM type
      Column: source_id    →    Start node identifier
      Column: target_id    →    End node identifier
      Column: ...          →    Relationship properties

Design Decisions

  1. One table per node label: Each Neo4j label maps to a UC table in the nodes schema
  2. One table per relationship type: Each relationship type maps to a UC table in the relationships schema, with source_id and target_id columns plus any relationship properties
  3. Properties become columns: Neo4j properties map to UC columns with types from the mapping table above
  4. Multi-label nodes: Nodes with multiple labels (e.g., :Person:Actor) appear in tables for each label
  5. Naming convention: UC identifiers are lowercase; Neo4j labels/types are converted accordingly (DEPARTS_FROMdeparts_from)

Implementation Approaches

There are three approaches to register Neo4j metadata in Unity Catalog, each with different tradeoffs.

Approach 1: UC Tables API (Register as External Tables)

Register Neo4j labels/relationships as EXTERNAL or FOREIGN table entries in UC using the Tables REST API.

API Endpoint: POST /api/2.1/unity-catalog/tables

{
  "name": "aircraft",
  "catalog_name": "neo4j_catalog",
  "schema_name": "nodes",
  "table_type": "EXTERNAL",
  "data_source_format": "JSON",
  "columns": [
    {"name": "aircraft_id", "type_name": "STRING", "nullable": false, "position": 0, "comment": "Unique aircraft identifier"},
    {"name": "manufacturer", "type_name": "STRING", "nullable": true, "position": 1},
    {"name": "model", "type_name": "STRING", "nullable": true, "position": 2},
    {"name": "icao24", "type_name": "STRING", "nullable": true, "position": 3}
  ],
  "properties": {
    "neo4j.label": "Aircraft",
    "neo4j.database": "neo4j",
    "neo4j.node_count": "60",
    "neo4j.connection": "neo4j_connection"
  },
  "comment": "Neo4j :Aircraft node label (60 nodes)"
}

Pros:

  • Tables appear in Catalog Explorer and INFORMATION_SCHEMA.TABLES / INFORMATION_SCHEMA.COLUMNS
  • Supports fine-grained access control (GRANT SELECT ON TABLE)
  • Column types, nullability, and comments are fully preserved
  • Can be queried via SQL INFORMATION_SCHEMA

Cons:

  • EXTERNAL tables require a storage_location — would need a workaround or placeholder
  • May not support actual query routing back to Neo4j without additional plumbing
  • Table metadata becomes stale if Neo4j schema changes (requires re-sync)

Prerequisite steps:

# 1. Create catalog
POST /api/2.1/unity-catalog/catalogs
{"name": "neo4j_catalog", "comment": "Neo4j graph database metadata"}

# 2. Create schemas
POST /api/2.1/unity-catalog/schemas
{"name": "nodes", "catalog_name": "neo4j_catalog", "comment": "Neo4j node labels"}

POST /api/2.1/unity-catalog/schemas
{"name": "relationships", "catalog_name": "neo4j_catalog", "comment": "Neo4j relationship types"}

# 3. Register tables (one per label/relationship type)
POST /api/2.1/unity-catalog/tables
{...}  # As shown above

Approach 2: External Metadata API (Public Preview)

Register Neo4j objects as external metadata entries, designed for systems outside the standard relational model.

API Endpoint: POST /api/2.0/lineage-tracking/external-metadata

{
  "name": "Aircraft",
  "system_type": "OTHER",
  "entity_type": "NodeLabel",
  "description": "Aircraft node label in Neo4j (60 nodes, 4 properties)",
  "columns": ["aircraft_id", "manufacturer", "model", "icao24"],
  "url": "neo4j+s://your-host:7687",
  "properties": {
    "neo4j.database": "neo4j",
    "neo4j.label": "Aircraft",
    "neo4j.node_count": "60",
    "neo4j.property.aircraft_id.type": "STRING",
    "neo4j.property.aircraft_id.indexed": "true",
    "neo4j.property.aircraft_id.unique": "true",
    "neo4j.property.manufacturer.type": "STRING",
    "neo4j.property.model.type": "STRING",
    "neo4j.property.icao24.type": "STRING"
  }
}

Pros:

  • Designed specifically for external systems that don't fit the relational model
  • Supports arbitrary entity_type values (e.g., NodeLabel, RelationshipType)
  • Free-form properties map for rich metadata
  • Supports lineage tracking via the External Lineage API

Cons:

  • columns is Array<string> — no type info per column (types must go in properties)
  • Does not appear in INFORMATION_SCHEMA.TABLES — only queryable via REST API
  • No NEO4J enum for system_type — must use OTHER
  • Public Preview — API may change
  • No direct query routing to Neo4j

Approach 3: Materialized Delta Tables (Data Copy)

Use the Neo4j Spark Connector to read data and write it as managed Delta tables in UC. Metadata is registered automatically.

# Read from Neo4j
df = spark.read.format("org.neo4j.spark.DataSource") \
    .option("url", "neo4j+s://your-host") \
    .option("authentication.type", "basic") \
    .option("authentication.basic.username", user) \
    .option("authentication.basic.password", password) \
    .option("labels", ":Aircraft") \
    .load()

# Write to UC as a managed Delta table
df.write.mode("overwrite").saveAsTable("neo4j_catalog.nodes.aircraft")

Pros:

  • Full UC integration — tables, columns, types, statistics, lineage all work automatically
  • Data is queryable via standard SQL (SELECT * FROM neo4j_catalog.nodes.aircraft)
  • Delta features (time travel, ACID transactions, optimization) apply
  • No custom sync code for metadata — schema is inferred from the DataFrame

Cons:

  • Copies data — not a live view of Neo4j
  • Requires scheduled refresh jobs to keep data current
  • Storage cost for duplicated data
  • Requires Single user access mode clusters (Neo4j Spark Connector limitation)
  • Loses graph structure — traversals require SQL JOINs

Recommended Architecture

A hybrid approach combining metadata registration with query routing:

┌──────────────────────────────────────────────────────┐
│                    Unity Catalog                      │
│                                                       │
│  neo4j_catalog                                        │
│  ├── nodes (schema)                                   │
│  │   ├── aircraft  ← metadata registered via API      │
│  │   ├── flight    ← metadata registered via API      │
│  │   └── airport   ← metadata registered via API      │
│  └── relationships (schema)                           │
│      ├── departs_from  ← metadata registered via API  │
│      └── arrives_at    ← metadata registered via API  │
│                                                       │
│  neo4j_connection  ← JDBC connection (existing)       │
└──────────────┬───────────────────────────────────────┘
               │
    ┌──────────┴──────────┐
    │   Sync Pipeline      │
    │   (Databricks Job)   │
    │                      │
    │  1. Introspect Neo4j │
    │  2. Map to UC schema │
    │  3. Register via API │
    │  4. Optionally       │
    │     materialize data │
    └──────────┬──────────┘
               │
    ┌──────────┴──────────┐
    │      Neo4j           │
    │                      │
    │  :Aircraft           │
    │  :Flight             │
    │  :Airport            │
    │  [:DEPARTS_FROM]     │
    │  [:ARRIVES_AT]       │
    └─────────────────────┘

Sync Pipeline Steps

  1. Introspect Neo4j schema — Call apoc.meta.schema() + SHOW CONSTRAINTS via the Neo4j Python driver or JDBC
  2. Map to UC model — Convert labels → tables, properties → columns, Neo4j types → UC types
  3. Register metadata in UC — Use Tables REST API or External Metadata API
  4. Optionally materialize data — Write high-value labels as Delta tables for SQL querying
  5. Schedule periodic sync — Run as a Databricks job to detect schema changes

Sync Pipeline Implementation Sketch

from neo4j import GraphDatabase
import requests
import json

# --- Step 1: Introspect Neo4j ---

def get_neo4j_schema(driver):
    """Extract full schema from Neo4j using APOC."""
    with driver.session() as session:
        result = session.run("CALL apoc.meta.schema({sample: -1})")
        record = result.single()
        return record["value"]

def get_neo4j_constraints(driver):
    """Get constraints for precise type and uniqueness info."""
    with driver.session() as session:
        result = session.run("SHOW CONSTRAINTS YIELD *")
        return [dict(record) for record in result]

# --- Step 2: Map to UC model ---

NEO4J_TO_UC_TYPE = {
    "STRING": "STRING",
    "INTEGER": "BIGINT",
    "FLOAT": "DOUBLE",
    "BOOLEAN": "BOOLEAN",
    "DATE": "DATE",
    "LOCAL DATETIME": "TIMESTAMP_NTZ",
    "ZONED DATETIME": "TIMESTAMP",
    "LOCAL TIME": "STRING",
    "ZONED TIME": "STRING",
    "DURATION": "STRING",
    "POINT": "STRING",
    "LIST": "ARRAY<STRING>",
}

def map_neo4j_type(neo4j_type):
    """Map Neo4j property type to UC column type."""
    return NEO4J_TO_UC_TYPE.get(neo4j_type, "STRING")

def schema_to_uc_tables(neo4j_schema, constraints):
    """Convert Neo4j schema to UC table definitions."""
    tables = []
    for name, meta in neo4j_schema.items():
        if meta.get("type") != "node":
            continue
        columns = []
        for i, (prop_name, prop_meta) in enumerate(meta.get("properties", {}).items()):
            columns.append({
                "name": prop_name,
                "type_name": map_neo4j_type(prop_meta.get("type", "STRING")),
                "nullable": not prop_meta.get("existence", False),
                "position": i,
            })
        tables.append({
            "name": name.lower(),
            "table_type": "EXTERNAL",
            "columns": columns,
            "properties": {
                "neo4j.label": name,
                "neo4j.node_count": str(meta.get("count", 0)),
            },
            "comment": f"Neo4j :{name} node label ({meta.get('count', 0)} nodes)",
        })
    return tables

# --- Step 3: Register in UC ---

def register_table(workspace_url, token, catalog, schema, table_def):
    """Register a table in Unity Catalog via REST API."""
    url = f"{workspace_url}/api/2.1/unity-catalog/tables"
    payload = {
        "catalog_name": catalog,
        "schema_name": schema,
        **table_def,
    }
    resp = requests.post(url, headers={"Authorization": f"Bearer {token}"}, json=payload)
    return resp.json()

Verifying Synced Metadata

After registration, verify metadata visibility:

-- Check registered tables
SELECT table_catalog, table_schema, table_name, table_type, comment
FROM system.information_schema.tables
WHERE table_catalog = 'neo4j_catalog'
ORDER BY table_schema, table_name;

-- Check column definitions
SELECT table_name, column_name, data_type, is_nullable, comment
FROM neo4j_catalog.information_schema.columns
WHERE table_schema = 'nodes'
ORDER BY table_name, ordinal_position;

-- Check connections
SELECT * FROM system.information_schema.connections
WHERE connection_name = 'neo4j_connection';

For External Metadata API objects (if using Approach 2):

# List external metadata objects
GET /api/2.0/lineage-tracking/external-metadata

# These do NOT appear in INFORMATION_SCHEMA — REST API only

Open Questions

  1. Which registration API to use? The Tables API provides richer integration (INFORMATION_SCHEMA, access control) but requires a storage_location for EXTERNAL tables. The External Metadata API is purpose-built for external systems but has limited column type support and no INFORMATION_SCHEMA visibility. Need to clarify with Databricks which they recommend for this use case.

  2. Query routing: Registering metadata makes Neo4j objects visible in UC, but does clicking a table in Catalog Explorer route queries back to Neo4j via the JDBC connection? Or is this metadata-only with queries still requiring explicit spark.read.format('jdbc') calls? If the former, how is the mapping configured?

  3. Sync frequency: How often should the schema be re-synced? Neo4j schemas evolve dynamically (new labels/properties appear when data is written). Options: on-demand, scheduled (hourly/daily), or event-driven (Neo4j triggers).

  4. Multi-label nodes: A node with labels :Person:Actor has properties that span both labels. Should these be represented as separate tables with shared columns, or as a combined table?

  5. Relationship connectivity: Should the relationships schema tables include source_label and target_label columns, or should this be captured only in table metadata/comments?

  6. Delta materialization scope: Which labels/relationships should be materialized as Delta tables vs. metadata-only entries? Materializing everything may be impractical for large graphs.


Related Notebooks

Two notebooks in uc-neo4j-test-suite/ implement the approaches described above:

Notebook Implements What It Does
metadata_sync_delta.ipynb Approach 3 (Materialized Delta Tables) Discovers Neo4j labels and relationship types via db.schema.nodeTypeProperties() and db.schema.relTypeProperties(), reads each via the Spark Connector, writes as managed Delta tables (neo4j_metadata.nodes.* and neo4j_metadata.relationships.*), and verifies metadata in INFORMATION_SCHEMA. Requires a single-user access mode cluster with the Neo4j Spark Connector.
metadata_sync_external.ipynb Approach 2 (External Metadata API) Discovers the same Neo4j schema, then registers each node label and relationship type via the External Metadata REST API. No data is copied — metadata-only registration for discoverability and lineage. Encodes Neo4j property types in the metadata properties map. Includes optional cleanup to delete registered objects.

See METADATA_SYNC_REPORT.md for prototype results from running these notebooks.


References