Skip to content

Foundation: rebase integration onto upstream, drop pass-throughs, triage SLT failures #12

@zfarrell

Description

@zfarrell

Context

This repo is a fork of datafusion-contrib/datafusion-ducklake, a DataFusion extension that provides read access to DuckLake catalogs.

A few months ago an agent team did substantial work on the ducklake-features/integration branch (265 commits, +23k net src LOC) to make this repo "1.0" feature-compliant with the DuckLake spec — DML write path, cross-backend metadata writers (PG / MySQL), CDC table functions, virtual columns, ALTER/DROP DDL, type-system fixes, and many correctness fixes.

The work was audited in 2026-05 and judged genuinely high-quality, but it cannot be merged as-is: it branched from upstream commit 59eb3da (PR datafusion-contrib#79) and upstream has since landed datafusion 53 / arrow 58, TableProvider::statistics(), and an embedded rowid design that conflicts with the fork's virtual-column work.

This ticket is the foundation — the prerequisite work that unblocks every other ticket in the set. Until this is done, no individual feature can be cleanly upstreamed.

Reference branch

ducklake-features/integration — read this branch's docs/ directory first, especially:

  • docs/project-status.md
  • docs/2026-03-05-retrospective.md
  • docs/2026-03-07-snapshot-awareness-audit.md

Scope

  1. Rebase ducklake-features/integration onto current upstream main (datafusion-contrib/datafusion-ducklake:main). Conflicts are expected in: src/metadata_provider_*.rs (upstream PR fix(writer): align ducklake_column / ducklake_data_file schema with the DuckLake spec datafusion-contrib/datafusion-ducklake#116 reshaped the schema), src/row_id.rs vs src/virtual_column_exec.rs (upstream PR feat: DuckLake row lineage (rowid virtual column) datafusion-contrib/datafusion-ducklake#115 added a competing design — leave the conflict in place for ticket fix: preserve review-52 delete-file error handling follow-up #11 to resolve), src/types.rs (partial overlap with upstream PRs fix: normalize type aliases and add promotion rules for schema evolution datafusion-contrib/datafusion-ducklake#82/feat: support list/array column types in DuckLake type mapping datafusion-contrib/datafusion-ducklake#89). For each conflict, prefer keeping the fork's superset behavior but adopting the upstream type/schema shape so the rest of upstream still compiles against it.
  2. Delete empty test scaffolds. The branch has ~30 test files that compile but contain zero #[test] functions — edge_case_tests.rs, merge_tests.rs, update_tests.rs, sql_dml_tests.rs, interop_type_tests.rs, cross_engine_*.rs, etc. Find them with for f in tests/*.rs; do n=$(rg -c "^#\[(tokio::)?test\]" "$f"); [ "${n:-0}" -eq 0 ] && echo "$f"; done and delete each. (Note: any test that genuinely covers a feature already has a populated file with the same scope — tests/update_tests.rs IS populated on the branch; the empty ones tend to be alternate-named stubs.)
  3. Fix the parity_basic_crud_after_insert test. It fails on a decimal formatting parity issue between DuckDB and DataFusion ("20" vs "20.000000"). Either align the formatter or scope the assertion to the value, not the string.
  4. Triage the 96 SLT failures in tests/sqllogictest_runner.rs. Categorize each by root cause (real conformance gap, harness quirk, type-format mismatch, unsupported syntax). Open follow-up tickets per category if needed — do not try to fix them all here. The branch's own docs/slt-failure-report.md is a starting point.
  5. Drop src/compaction_functions.rs entirely. It is a DuckDB pass-through that shells out to the ducklake DuckDB extension via an in-memory ATTACH. Per project direction, no feature should depend on libduckdb being present at runtime for the actual logic. Compaction will be reintroduced later as a native implementation.
  6. Push the rebased branch as a fresh integration target for the downstream tickets. Suggested name: upstream-port/integration.

Acceptance criteria

  • cargo build --all-features clean on the rebased branch
  • cargo test passes at least the active (non-SLT) test surface (target: parity with audit — 827+ passing, 0 ignored outside external-service tests, 0 failures outside the cataloged SLT triage)
  • cargo clippy --all-targets produces no new warnings vs upstream main
  • No empty test scaffold files remain
  • src/compaction_functions.rs is removed (and its tests, if any)
  • The only remaining duckdb crate references are in src/metadata_provider_duckdb.rs, src/error.rs, and src/lib.rs (catalog backend, not feature implementations)
  • SLT failures categorized in a follow-up doc/issue
  • Rebased branch pushed and linked from this issue

Out of scope

  • Implementing any of the individual feature workstreams (separate tickets)
  • Reconciling the virtual_column_exec vs row_id design (separate ticket)
  • Filling in genuinely-missing test coverage (separate tickets per feature)

Notes

  • Audit-identified correctness bug to fix as part of rebase if it survives: src/cdc_common.rs:107-114 — duplicate-column projection collapses (SELECT col, col returns only one source position), and the test test_cdc_projection_duplicate_columns asserts the buggy behavior. Do not fix here — fix in the CDC ticket. Just don't let it regress.
  • The branch's docs/ directory (56 files) is agent process artifacts (R1–R11 review cycles, retrospectives, gap analyses). It should NOT be carried into upstream PRs. Move it to a separate archive branch or .audit/ directory after rebase.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions