Skip to content

[epic] multi-catalog support #107

@zfarrell

Description

@zfarrell

Background

DataFusion uses a three-level catalog hierarchy (catalog → schema → table). DuckLake 1.0 is two-level (schema → table). Today this extension collapses them: one DuckLake metadata database corresponds to one DataFusion catalog. This blocks use cases that require multiple catalogs without the operational burden of maintaining parallel metadata infrastructure (for example, DuckDB's pattern of attaching the same metadata database multiple times under different METADATA_SCHEMA values).

Goals

  • Refactor the extension's metadata provider trait so catalog becomes a first-class dimension.
  • Preserve compatibility with vanilla DuckLake 1.0 metadata. When using a standard metadata provider against a standard metadata database, the extension exposes a single implicit default catalog and behaves exactly as it does today.
  • Enable a specialized metadata provider, backed by a metadata schema that is a superset of DuckLake 1.0, to expose multiple catalogs from a single metadata database.
  • Longer term, help drive an upstream DuckLake spec change for first-class catalogs. Not an immediate goal.

Non-goals

  • Cross-engine interoperability of the multi-catalog functionality. A catalog-aware metadata database is not expected to be readable by DuckDB or other DuckLake implementations.
  • Compatibility with standard DuckLake metadata applies only when using a compatible metadata provider. The multi-catalog provider is not required to also read or write standard DuckLake metadata.
  • Upstream spec changes as part of delivering this.

Design note: metadata changes are not cosmetic

A naive approach would add a ducklake_catalog table and a catalog_id column on ducklake_schema, letting scoping flow transitively through schema_id. On review, this is not sufficient.

schema_version on ducklake_snapshot is a single global counter for the entire instance. Any DDL anywhere bumps it, and cache invalidation keys off this number:

ducklake_snapshot
├── snapshot_id
├── schema_version   (one number, entire instance)
├── next_catalog_id
└── next_file_id

With multiple catalogs in one metadata database, an ALTER TABLE on a small staging table in catalog B would invalidate cached metadata for every schema in catalog A. ducklake_snapshot_lineage helps with time travel and conflict detection but does not scope schema_version per catalog.

The implication: catalog_id needs to propagate down through the metadata, at minimum into snapshot-level versioning, so cache invalidation and snapshot evolution can be scoped per catalog rather than instance-wide. The exact shape is part of this work.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions