Skip to content

Partitioned INSERT + write-path expansion (stats, footer-size, partition layout) #25

@zfarrell

Description

@zfarrell

Context

This is one ticket in a series carrying forward #12 foundation work. Read #12 first for repo context.

Upstream has a basic src/insert_exec.rs and src/table_writer.rs. The fork expanded these substantially: insert_exec.rs +965 lines, table_writer.rs +1875 lines. The expansion covers partitioned INSERT (Hive layout: partition_col=value/...), column-stats collection during write, multi-file INSERTs, footer-size capture for read-side optimization, encryption-feature gating, and the UploadCleanupGuard orphan-cleanup pattern.

Reference branch

ducklake-features/integration:

  • src/insert_exec.rs — physical exec, expanded
  • src/table_writer.rs — shared write helpers, substantially expanded
  • Tests: tests/write_partition_tests.rs (400 lines, partitioned INSERT round-trips with filesystem inspection of region=US/region=EU/ layout), tests/write_tests.rs (+324 lines), tests/stats_tests.rs (447 lines, column-stats round-trips)

Scope

  1. Port the partitioned INSERT path. Partition columns identified from the table's catalog metadata; each output batch routed to the right partition_col=value/ directory; partition values URL-encoded as DuckLake spec requires.
  2. Port column-stats collection. During Parquet write, capture Statistics (min/max/null_count, distinct_count where cheap) and register them in ducklake_column_stats (or wherever the spec puts them — verify against current upstream schema after Foundation: rebase integration onto upstream, drop pass-throughs, triage SLT failures #12 rebase).
  3. Port footer-size capture and persistence. The fork stores Parquet footer size in metadata for read-side with_metadata_size_hint() optimization; this is already exploited on the read path. Ensure the write path persists it.
  4. Port the UploadCleanupGuard pattern — uploaded files are cleaned up on commit failure (already partially mentioned in DELETE physical execution (MOR delete files) #17/UPDATE physical execution (MOR delete + insert) #18 reference, but the source-of-truth implementation lives here).
  5. Encryption gating: writes honor the encryption feature flag the same way reads do. Do not break the build when the flag is off.

Acceptance criteria

  • tests/write_partition_tests.rs passes — verify region=US/, region=EU/ Hive directories actually contain the right Parquet files with the right rows
  • tests/stats_tests.rs passes — TableProvider::statistics() returns correct min/max/null_count after multi-append (the test uses Precision::Inexact(ScalarValue::Int32(10)) style assertions)
  • Partition values containing special characters (/, =, spaces, unicode) are correctly URL-encoded
  • Concurrent INSERT into the same partition from two transactions: both commit successfully (additive)
  • Commit failure mid-INSERT: no orphan files left on disk (verify filesystem inspection in test)
  • No duckdb crate imports
  • Footer-size is captured during write and read back during scan, eliminating the second-read on Parquet open

Dependencies

Out of scope

  • Bucketing / hash-partitioning beyond Hive value-partitioning — DuckLake spec doesn't define this
  • Adaptive file-size targeting (the compaction question) — separate concern

Notes

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions