Background
DataFusion uses a three-level catalog hierarchy (catalog → schema → table). DuckLake 1.0 is two-level (schema → table). Today this extension collapses them: one DuckLake metadata database corresponds to one DataFusion catalog. This blocks use cases that require multiple catalogs without the operational burden of maintaining parallel metadata infrastructure (for example, DuckDB's pattern of attaching the same metadata database multiple times under different METADATA_SCHEMA values).
Goals
- Refactor the extension's metadata provider trait so catalog becomes a first-class dimension.
- Preserve compatibility with vanilla DuckLake 1.0 metadata. When using a standard metadata provider against a standard metadata database, the extension exposes a single implicit default catalog and behaves exactly as it does today.
- Enable a specialized metadata provider, backed by a metadata schema that is a superset of DuckLake 1.0, to expose multiple catalogs from a single metadata database.
- Longer term, help drive an upstream DuckLake spec change for first-class catalogs. Not an immediate goal.
Non-goals
- Cross-engine interoperability of the multi-catalog functionality. A catalog-aware metadata database is not expected to be readable by DuckDB or other DuckLake implementations.
- Compatibility with standard DuckLake metadata applies only when using a compatible metadata provider. The multi-catalog provider is not required to also read or write standard DuckLake metadata.
- Upstream spec changes as part of delivering this.
Design note: metadata changes are not cosmetic
A naive approach would add a ducklake_catalog table and a catalog_id column on ducklake_schema, letting scoping flow transitively through schema_id. On review, this is not sufficient.
schema_version on ducklake_snapshot is a single global counter for the entire instance. Any DDL anywhere bumps it, and cache invalidation keys off this number:
ducklake_snapshot
├── snapshot_id
├── schema_version (one number, entire instance)
├── next_catalog_id
└── next_file_id
With multiple catalogs in one metadata database, an ALTER TABLE on a small staging table in catalog B would invalidate cached metadata for every schema in catalog A. ducklake_snapshot_lineage helps with time travel and conflict detection but does not scope schema_version per catalog.
The implication: catalog_id needs to propagate down through the metadata, at minimum into snapshot-level versioning, so cache invalidation and snapshot evolution can be scoped per catalog rather than instance-wide. The exact shape is part of this work.
Background
DataFusion uses a three-level catalog hierarchy (catalog → schema → table). DuckLake 1.0 is two-level (schema → table). Today this extension collapses them: one DuckLake metadata database corresponds to one DataFusion catalog. This blocks use cases that require multiple catalogs without the operational burden of maintaining parallel metadata infrastructure (for example, DuckDB's pattern of attaching the same metadata database multiple times under different
METADATA_SCHEMAvalues).Goals
Non-goals
Design note: metadata changes are not cosmetic
A naive approach would add a
ducklake_catalogtable and acatalog_idcolumn onducklake_schema, letting scoping flow transitively throughschema_id. On review, this is not sufficient.schema_versiononducklake_snapshotis a single global counter for the entire instance. Any DDL anywhere bumps it, and cache invalidation keys off this number:With multiple catalogs in one metadata database, an
ALTER TABLEon a small staging table in catalog B would invalidate cached metadata for every schema in catalog A.ducklake_snapshot_lineagehelps with time travel and conflict detection but does not scopeschema_versionper catalog.The implication:
catalog_idneeds to propagate down through the metadata, at minimum into snapshot-level versioning, so cache invalidation and snapshot evolution can be scoped per catalog rather than instance-wide. The exact shape is part of this work.