Skip to content

CDC table functions (with cdc_common duplicate-column bug fix) #21

@zfarrell

Description

@zfarrell

Context

This is one ticket in a series carrying forward #12 foundation work. Read #12 first for repo context.

DuckLake's CDC surface is exposed as table functions: ducklake_table_changes(table, from_snap, to_snap), ducklake_table_insertions, ducklake_table_deletions, ducklake_current_snapshot, ducklake_last_committed_snapshot. Upstream has placeholder implementations of table_changes.rs and table_deletions.rs; the fork has substantially expanded these (+340 lines and +364 lines respectively) and added cdc_common.rs plus table_insertions.rs plus expanded table_functions.rs (+873).

Reference branch

ducklake-features/integration:

  • src/cdc_common.rs — shared CDC projection logic (⚠️ has a known bug — see below)
  • src/table_insertions.rs — new TableProvider for insertion CDC
  • src/table_changes.rs — expanded
  • src/table_deletions.rs — expanded
  • src/table_functions.rs — registration and dispatch
  • Tests: tests/time_travel_tests.rs (14 tests), tests/table_function_tests.rs (10 tests)

Scope

  1. Port cdc_common.rs, table_insertions.rs, the expanded table_changes.rs / table_deletions.rs, and the new entries in table_functions.rs.
  2. Fix the audit-identified bug in cdc_common.rs (lines ~107–114). Duplicate-column projections (SELECT col, col) collapse to a single source position because table_indices.iter().position(|&ti| ti == idx) returns the first match. The accompanying test test_cdc_projection_duplicate_columns asserts the buggy behavior (assert_eq!(reorder[1], 0)) — fix the test too. The correct behavior: each output slot maps to its own batch column index.
  3. Verify the snapshot-bounded predicates from R11-S-001/004/005/006/018/022/025 (referenced in commit messages) are in place — these were a hardening round on snapshot-awareness.
  4. ducklake_current_snapshot and ducklake_last_committed_snapshot are intentionally simple — port them verbatim.
  5. Ensure none of the table functions reach for duckdb::Connection. All CDC work should be native — operate on the MetadataProvider and Parquet I/O directly.

Acceptance criteria

  • tests/time_travel_tests.rs all 14 tests pass
  • tests/table_function_tests.rs all 10 tests pass
  • New test added covering SELECT col, col FROM ducklake_table_changes(...) that confirms each output column is independent (the bug fix)
  • test_cdc_projection_duplicate_columns updated to assert the correct mapping (each output slot has its own source index)
  • No duckdb crate imports in any of the affected files
  • Schema-shape parity with DuckDB's reference implementation (column names + types) for each table function — table_function_tests covers this; preserve it

Dependencies

Out of scope

  • Streaming CDC (SUBSCRIBE semantics) — DuckLake spec does not require it
  • Schema-evolution-aware CDC where the column set differs across the requested range — error out clearly; full support is a separate effort

Notes

  • Audit verdict on cdc_common.rs: "concerning — has a real bug in duplicate-column projection handling." This is the only correctness defect found across the entire fork audit; fixing it is the primary correctness goal of this ticket.
  • Audit verdict on table_insertions.rs: "solid — straightforward TableProvider mirroring upstream table.rs patterns."

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions