When Databricks refers to "Unity Catalog metadata synchronization," they mean making the schema structure of an external data source (Neo4j) visible as first-class objects within Unity Catalog's three-level namespace (catalog.schema.table). This enables:
- Discoverability: Users browse Neo4j node labels and relationship types in Catalog Explorer alongside Delta tables
- Governance: Fine-grained access control (USE_CATALOG, USE_SCHEMA, SELECT) on Neo4j objects
- Lineage tracking: Data lineage from Neo4j through downstream Spark jobs and dashboards
- Schema documentation: Column types, descriptions, and tags managed centrally in UC
This is distinct from what the existing JDBC connection provides. The JDBC connection (CREATE CONNECTION ... TYPE JDBC) registers credentials and a driver — it does not expose Neo4j's schema as browsable UC objects.
| Aspect | JDBC Connection (Current) | Metadata Sync (Needed) |
|---|---|---|
| What's registered in UC | A CONNECTION object only |
Catalog + schemas + tables + columns as UC objects |
| Schema browsing in Catalog Explorer | No | Yes |
| Access control granularity | Connection-level only | Table-level and column-level |
| Lineage tracking | No | Yes |
| Query model | spark.read.format('jdbc').option('databricks.connection', ...) |
Could also support SELECT * FROM neo4j_catalog.graph.persons |
Databricks provides automatic metadata sync for Lakehouse Federation sources — but only for these databases: MySQL, PostgreSQL, Oracle, Teradata, Redshift, SQL Server, Azure Synapse, BigQuery, Snowflake, Salesforce Data Cloud, and Databricks-to-Databricks.
For federated sources, Unity Catalog syncs the remote schema on every interaction:
"Foreign catalog metadata is synced into Unity Catalog on each interaction with the catalog."
Neo4j is not on the supported list. Additionally, the JDBC connection type explicitly does not support foreign catalogs:
"Foreign catalogs are not supported with JDBC connections."
This means we must build a custom metadata sync pipeline.
Neo4j provides several methods to extract schema metadata. Here are the recommended approaches, ordered by richness:
CALL apoc.meta.schema({sample: -1})
YIELD valueReturns a nested map containing:
- Node labels with counts
- Property names and types (uses Cypher 5 canonical names:
STRING,INTEGER,FLOAT) - Indexed/unique/existence flags per property
- Relationship types with direction, connectivity, and counts
- Relationship properties
Example output structure:
{
"Aircraft": {
"type": "node",
"count": 60,
"properties": {
"aircraft_id": {"type": "STRING", "indexed": true, "unique": true, "existence": false},
"manufacturer": {"type": "STRING", "indexed": false, "unique": false, "existence": false},
"model": {"type": "STRING", "indexed": false, "unique": false, "existence": false},
"icao24": {"type": "STRING", "indexed": false, "unique": false, "existence": false}
},
"relationships": {
"DEPARTS_FROM": {
"direction": "out",
"count": 2400,
"labels": ["Airport"],
"properties": {}
}
}
}
}Requires: APOC Core plugin installed on the Neo4j instance.
-- Node label properties
CALL db.schema.nodeTypeProperties()
-- Returns: nodeType, nodeLabels, propertyName, propertyTypes, mandatory
-- Relationship type properties
CALL db.schema.relTypeProperties()
-- Returns: relType, propertyName, propertyTypes, mandatoryThese use older internal type names (String, Long, Double) instead of Cypher 5 canonical names.
Note: Some documentation references
sourceNodeLabelsandtargetNodeLabelsas output columns fordb.schema.relTypeProperties(), but these are not available on Neo4j 5.x Aura (verified on 5.27-aura). To discover relationship source/target patterns, useapoc.meta.schema()ordb.schema.visualization()instead.
-- Constraints (uniqueness, existence, type, key)
SHOW CONSTRAINTS YIELD *
-- Indexes (range, fulltext, text, point, vector)
SHOW INDEXES YIELD *CALL apoc.meta.schema({sample: -1})— property names, types, indexed/unique flags, countsSHOW CONSTRAINTS YIELD *— precise constraint types and mandatory enforcementSHOW INDEXES YIELD *— standalone (non-constraint) indexes
| Neo4j Type (Internal) | Neo4j Cypher 5 Type | UC/Spark SQL Type | Notes |
|---|---|---|---|
String |
STRING |
STRING |
No length limit in Neo4j |
Long |
INTEGER |
BIGINT |
64-bit signed |
Double |
FLOAT |
DOUBLE |
64-bit IEEE 754 |
Boolean |
BOOLEAN |
BOOLEAN |
|
Date |
DATE |
DATE |
|
LocalDateTime |
LOCAL DATETIME |
TIMESTAMP_NTZ |
|
DateTime |
ZONED DATETIME |
TIMESTAMP |
|
LocalTime |
LOCAL TIME |
STRING |
Serialize as ISO string |
Time |
ZONED TIME |
STRING |
No direct SQL equivalent |
Duration |
DURATION |
STRING |
Serialize as ISO 8601 duration |
Point |
POINT |
STRING |
Serialize as WKT or JSON |
StringArray |
LIST<STRING> |
ARRAY<STRING> |
|
LongArray |
LIST<INTEGER> |
ARRAY<BIGINT> |
|
DoubleArray |
LIST<FLOAT> |
ARRAY<DOUBLE> |
|
BooleanArray |
LIST<BOOLEAN> |
ARRAY<BOOLEAN> |
|
ByteArray |
(byte array) | BINARY |
Neo4j's graph model (labels, relationships, properties) must be mapped to UC's relational model (catalogs, schemas, tables, columns).
Unity Catalog Neo4j
───────────── ─────
Catalog: neo4j_catalog → Neo4j database
Schema: nodes → Node labels namespace
Table: aircraft → :Aircraft label
Column: aircraft_id → aircraft_id property (STRING)
Column: manufacturer → manufacturer property (STRING)
Table: flight → :Flight label
Table: airport → :Airport label
Schema: relationships → Relationship types namespace
Table: departs_from → :DEPARTS_FROM type
Column: source_id → Start node identifier
Column: target_id → End node identifier
Column: ... → Relationship properties
- One table per node label: Each Neo4j label maps to a UC table in the
nodesschema - One table per relationship type: Each relationship type maps to a UC table in the
relationshipsschema, withsource_idandtarget_idcolumns plus any relationship properties - Properties become columns: Neo4j properties map to UC columns with types from the mapping table above
- Multi-label nodes: Nodes with multiple labels (e.g.,
:Person:Actor) appear in tables for each label - Naming convention: UC identifiers are lowercase; Neo4j labels/types are converted accordingly (
DEPARTS_FROM→departs_from)
There are three approaches to register Neo4j metadata in Unity Catalog, each with different tradeoffs.
Register Neo4j labels/relationships as EXTERNAL or FOREIGN table entries in UC using the Tables REST API.
API Endpoint: POST /api/2.1/unity-catalog/tables
{
"name": "aircraft",
"catalog_name": "neo4j_catalog",
"schema_name": "nodes",
"table_type": "EXTERNAL",
"data_source_format": "JSON",
"columns": [
{"name": "aircraft_id", "type_name": "STRING", "nullable": false, "position": 0, "comment": "Unique aircraft identifier"},
{"name": "manufacturer", "type_name": "STRING", "nullable": true, "position": 1},
{"name": "model", "type_name": "STRING", "nullable": true, "position": 2},
{"name": "icao24", "type_name": "STRING", "nullable": true, "position": 3}
],
"properties": {
"neo4j.label": "Aircraft",
"neo4j.database": "neo4j",
"neo4j.node_count": "60",
"neo4j.connection": "neo4j_connection"
},
"comment": "Neo4j :Aircraft node label (60 nodes)"
}Pros:
- Tables appear in Catalog Explorer and
INFORMATION_SCHEMA.TABLES/INFORMATION_SCHEMA.COLUMNS - Supports fine-grained access control (GRANT SELECT ON TABLE)
- Column types, nullability, and comments are fully preserved
- Can be queried via SQL
INFORMATION_SCHEMA
Cons:
EXTERNALtables require astorage_location— would need a workaround or placeholder- May not support actual query routing back to Neo4j without additional plumbing
- Table metadata becomes stale if Neo4j schema changes (requires re-sync)
Prerequisite steps:
# 1. Create catalog
POST /api/2.1/unity-catalog/catalogs
{"name": "neo4j_catalog", "comment": "Neo4j graph database metadata"}
# 2. Create schemas
POST /api/2.1/unity-catalog/schemas
{"name": "nodes", "catalog_name": "neo4j_catalog", "comment": "Neo4j node labels"}
POST /api/2.1/unity-catalog/schemas
{"name": "relationships", "catalog_name": "neo4j_catalog", "comment": "Neo4j relationship types"}
# 3. Register tables (one per label/relationship type)
POST /api/2.1/unity-catalog/tables
{...} # As shown aboveRegister Neo4j objects as external metadata entries, designed for systems outside the standard relational model.
API Endpoint: POST /api/2.0/lineage-tracking/external-metadata
{
"name": "Aircraft",
"system_type": "OTHER",
"entity_type": "NodeLabel",
"description": "Aircraft node label in Neo4j (60 nodes, 4 properties)",
"columns": ["aircraft_id", "manufacturer", "model", "icao24"],
"url": "neo4j+s://your-host:7687",
"properties": {
"neo4j.database": "neo4j",
"neo4j.label": "Aircraft",
"neo4j.node_count": "60",
"neo4j.property.aircraft_id.type": "STRING",
"neo4j.property.aircraft_id.indexed": "true",
"neo4j.property.aircraft_id.unique": "true",
"neo4j.property.manufacturer.type": "STRING",
"neo4j.property.model.type": "STRING",
"neo4j.property.icao24.type": "STRING"
}
}Pros:
- Designed specifically for external systems that don't fit the relational model
- Supports arbitrary
entity_typevalues (e.g.,NodeLabel,RelationshipType) - Free-form
propertiesmap for rich metadata - Supports lineage tracking via the External Lineage API
Cons:
columnsisArray<string>— no type info per column (types must go inproperties)- Does not appear in
INFORMATION_SCHEMA.TABLES— only queryable via REST API - No
NEO4Jenum forsystem_type— must useOTHER - Public Preview — API may change
- No direct query routing to Neo4j
Use the Neo4j Spark Connector to read data and write it as managed Delta tables in UC. Metadata is registered automatically.
# Read from Neo4j
df = spark.read.format("org.neo4j.spark.DataSource") \
.option("url", "neo4j+s://your-host") \
.option("authentication.type", "basic") \
.option("authentication.basic.username", user) \
.option("authentication.basic.password", password) \
.option("labels", ":Aircraft") \
.load()
# Write to UC as a managed Delta table
df.write.mode("overwrite").saveAsTable("neo4j_catalog.nodes.aircraft")Pros:
- Full UC integration — tables, columns, types, statistics, lineage all work automatically
- Data is queryable via standard SQL (
SELECT * FROM neo4j_catalog.nodes.aircraft) - Delta features (time travel, ACID transactions, optimization) apply
- No custom sync code for metadata — schema is inferred from the DataFrame
Cons:
- Copies data — not a live view of Neo4j
- Requires scheduled refresh jobs to keep data current
- Storage cost for duplicated data
- Requires
Single useraccess mode clusters (Neo4j Spark Connector limitation) - Loses graph structure — traversals require SQL JOINs
A hybrid approach combining metadata registration with query routing:
┌──────────────────────────────────────────────────────┐
│ Unity Catalog │
│ │
│ neo4j_catalog │
│ ├── nodes (schema) │
│ │ ├── aircraft ← metadata registered via API │
│ │ ├── flight ← metadata registered via API │
│ │ └── airport ← metadata registered via API │
│ └── relationships (schema) │
│ ├── departs_from ← metadata registered via API │
│ └── arrives_at ← metadata registered via API │
│ │
│ neo4j_connection ← JDBC connection (existing) │
└──────────────┬───────────────────────────────────────┘
│
┌──────────┴──────────┐
│ Sync Pipeline │
│ (Databricks Job) │
│ │
│ 1. Introspect Neo4j │
│ 2. Map to UC schema │
│ 3. Register via API │
│ 4. Optionally │
│ materialize data │
└──────────┬──────────┘
│
┌──────────┴──────────┐
│ Neo4j │
│ │
│ :Aircraft │
│ :Flight │
│ :Airport │
│ [:DEPARTS_FROM] │
│ [:ARRIVES_AT] │
└─────────────────────┘
- Introspect Neo4j schema — Call
apoc.meta.schema()+SHOW CONSTRAINTSvia the Neo4j Python driver or JDBC - Map to UC model — Convert labels → tables, properties → columns, Neo4j types → UC types
- Register metadata in UC — Use Tables REST API or External Metadata API
- Optionally materialize data — Write high-value labels as Delta tables for SQL querying
- Schedule periodic sync — Run as a Databricks job to detect schema changes
from neo4j import GraphDatabase
import requests
import json
# --- Step 1: Introspect Neo4j ---
def get_neo4j_schema(driver):
"""Extract full schema from Neo4j using APOC."""
with driver.session() as session:
result = session.run("CALL apoc.meta.schema({sample: -1})")
record = result.single()
return record["value"]
def get_neo4j_constraints(driver):
"""Get constraints for precise type and uniqueness info."""
with driver.session() as session:
result = session.run("SHOW CONSTRAINTS YIELD *")
return [dict(record) for record in result]
# --- Step 2: Map to UC model ---
NEO4J_TO_UC_TYPE = {
"STRING": "STRING",
"INTEGER": "BIGINT",
"FLOAT": "DOUBLE",
"BOOLEAN": "BOOLEAN",
"DATE": "DATE",
"LOCAL DATETIME": "TIMESTAMP_NTZ",
"ZONED DATETIME": "TIMESTAMP",
"LOCAL TIME": "STRING",
"ZONED TIME": "STRING",
"DURATION": "STRING",
"POINT": "STRING",
"LIST": "ARRAY<STRING>",
}
def map_neo4j_type(neo4j_type):
"""Map Neo4j property type to UC column type."""
return NEO4J_TO_UC_TYPE.get(neo4j_type, "STRING")
def schema_to_uc_tables(neo4j_schema, constraints):
"""Convert Neo4j schema to UC table definitions."""
tables = []
for name, meta in neo4j_schema.items():
if meta.get("type") != "node":
continue
columns = []
for i, (prop_name, prop_meta) in enumerate(meta.get("properties", {}).items()):
columns.append({
"name": prop_name,
"type_name": map_neo4j_type(prop_meta.get("type", "STRING")),
"nullable": not prop_meta.get("existence", False),
"position": i,
})
tables.append({
"name": name.lower(),
"table_type": "EXTERNAL",
"columns": columns,
"properties": {
"neo4j.label": name,
"neo4j.node_count": str(meta.get("count", 0)),
},
"comment": f"Neo4j :{name} node label ({meta.get('count', 0)} nodes)",
})
return tables
# --- Step 3: Register in UC ---
def register_table(workspace_url, token, catalog, schema, table_def):
"""Register a table in Unity Catalog via REST API."""
url = f"{workspace_url}/api/2.1/unity-catalog/tables"
payload = {
"catalog_name": catalog,
"schema_name": schema,
**table_def,
}
resp = requests.post(url, headers={"Authorization": f"Bearer {token}"}, json=payload)
return resp.json()After registration, verify metadata visibility:
-- Check registered tables
SELECT table_catalog, table_schema, table_name, table_type, comment
FROM system.information_schema.tables
WHERE table_catalog = 'neo4j_catalog'
ORDER BY table_schema, table_name;
-- Check column definitions
SELECT table_name, column_name, data_type, is_nullable, comment
FROM neo4j_catalog.information_schema.columns
WHERE table_schema = 'nodes'
ORDER BY table_name, ordinal_position;
-- Check connections
SELECT * FROM system.information_schema.connections
WHERE connection_name = 'neo4j_connection';For External Metadata API objects (if using Approach 2):
# List external metadata objects
GET /api/2.0/lineage-tracking/external-metadata
# These do NOT appear in INFORMATION_SCHEMA — REST API only-
Which registration API to use? The Tables API provides richer integration (INFORMATION_SCHEMA, access control) but requires a
storage_locationfor EXTERNAL tables. The External Metadata API is purpose-built for external systems but has limited column type support and no INFORMATION_SCHEMA visibility. Need to clarify with Databricks which they recommend for this use case. -
Query routing: Registering metadata makes Neo4j objects visible in UC, but does clicking a table in Catalog Explorer route queries back to Neo4j via the JDBC connection? Or is this metadata-only with queries still requiring explicit
spark.read.format('jdbc')calls? If the former, how is the mapping configured? -
Sync frequency: How often should the schema be re-synced? Neo4j schemas evolve dynamically (new labels/properties appear when data is written). Options: on-demand, scheduled (hourly/daily), or event-driven (Neo4j triggers).
-
Multi-label nodes: A node with labels
:Person:Actorhas properties that span both labels. Should these be represented as separate tables with shared columns, or as a combined table? -
Relationship connectivity: Should the
relationshipsschema tables includesource_labelandtarget_labelcolumns, or should this be captured only in table metadata/comments? -
Delta materialization scope: Which labels/relationships should be materialized as Delta tables vs. metadata-only entries? Materializing everything may be impractical for large graphs.
Two notebooks in uc-neo4j-test-suite/ implement the approaches described above:
| Notebook | Implements | What It Does |
|---|---|---|
metadata_sync_delta.ipynb |
Approach 3 (Materialized Delta Tables) | Discovers Neo4j labels and relationship types via db.schema.nodeTypeProperties() and db.schema.relTypeProperties(), reads each via the Spark Connector, writes as managed Delta tables (neo4j_metadata.nodes.* and neo4j_metadata.relationships.*), and verifies metadata in INFORMATION_SCHEMA. Requires a single-user access mode cluster with the Neo4j Spark Connector. |
metadata_sync_external.ipynb |
Approach 2 (External Metadata API) | Discovers the same Neo4j schema, then registers each node label and relationship type via the External Metadata REST API. No data is copied — metadata-only registration for discoverability and lineage. Encodes Neo4j property types in the metadata properties map. Includes optional cleanup to delete registered objects. |
See METADATA_SYNC_REPORT.md for prototype results from running these notebooks.