Skip to content

[epic] Full feature parity with DuckLake v1.0 #114

@zfarrell

Description

@zfarrell

Summary

The DuckLake format reached v1.0 (production-ready, backward-compat guaranteed) on 2026-04-13 alongside DuckDB v1.5.2. The v1.0 catalog schema introduces new metadata tables and features beyond what datafusion-ducklake currently understands. This epic tracks closing that gap so a v1.0 catalog written by the ducklake DuckDB extension is fully readable and writable through DataFusion.

Already covered

INSERT/DELETE writes, S3 write support, SQLite/MySQL/Postgres metadata backends, list/array types, PME-encrypted Parquet reads, byte-size statistics, snapshot isolation, basic schema evolution (type promotion, column rename). Data inlining + async metadata provider is on branch in #106 — land that rather than rebuild.

Outstanding work

Sub-tasks to split out:

  • Catalog schema version validation. Read the v1.0 schema version from ducklake_metadata; reject unknown versions with a clear error. Today the provider doesn't check at all.
  • Partitioning. Honor partition_column / partition_info / file_partition_value for partition pruning. Support v1.0's bucket(N, column) transform (murmur3, Iceberg-compatible) alongside identity/year/month/day/hour. Flagged as TODO in CLAUDE.md.
  • Sorted tables. Read sort_info / sort_expression so DataFusion can exploit ordering for filter and limit pushdown; preserve sort on writes.
  • Deletion vectors. Read v1.0's Iceberg-v3-compatible deletion vectors (roaring bitmaps in Puffin files) in addition to the existing positional-delete-file path in delete_filter.rs. (Note: marked experimental in the v1.0 announcement.)
  • Column-level statistics. Surface file_column_stats / table_column_stats / table_stats into DataFusion's Statistics. Builds on feat(table): implement TableProvider::statistics() with byte-size aggregate #112.
  • Struct and map types. Currently error in types.rs. Precondition for nested GEOMETRY and VARIANT.
  • Geometry type. Replace today's Binary mapping with proper GEOMETRY. Read bounding-box stats from file_column_stats for spatial filter pushdown. Support nesting inside structs/lists/maps.
  • Variant type. Map VARIANT to Arrow; read file_variant_stats for shredded-sub-field file skipping.
  • Column mapping / name mapping. Use column_mapping and name_mapping to resolve renamed/dropped/re-added columns across snapshots; verify field-id-based Parquet reads.
  • Time travel. Expose user-facing snapshot selection (read at snapshot ID / timestamp). Snapshot pinning exists internally but isn't surfaced.
  • Views and macros. Surface view, macro, macro_impl, macro_parameters in DataFusion's catalog (at minimum, list and read views).
  • Tags. Read snapshot/table/column tags from tag / column_tag and expose via metadata APIs.
  • File lifecycle. Respect files_scheduled_for_deletion so the read path skips soft-deleted files; honor on the write path during compaction.
  • Add existing Parquet without copy. v1.0 supports registering pre-existing Parquet into the catalog without rewriting; expose on the write path.
  • UPDATE and ALTER TABLE. Today's writes cover INSERT and DELETE; UPDATE and schema mutations (add/drop/rename column, etc.) are missing.

Definition of done

  • Round-trip parity tests against catalogs produced by the ducklake DuckDB extension at v1.0 (DuckDB ≥ 1.5.2), exercising each feature above.
  • Reverse round-trip: catalogs written by datafusion-ducklake are readable by the DuckDB extension.
  • README and CLAUDE.md updated to reflect v1.0 parity (drop the "Current Limitations" list; refresh the "DuckDB only" note).
  • Catalog schema version negotiation: unknown versions are explicitly rejected with a clear error.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions