You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A few months ago an agent team did substantial work on the ducklake-features/integration branch (265 commits, +23k net src LOC) to make this repo "1.0" feature-compliant with the DuckLake spec — DML write path, cross-backend metadata writers (PG / MySQL), CDC table functions, virtual columns, ALTER/DROP DDL, type-system fixes, and many correctness fixes.
The work was audited in 2026-05 and judged genuinely high-quality, but it cannot be merged as-is: it branched from upstream commit 59eb3da (PR datafusion-contrib#79) and upstream has since landed datafusion 53 / arrow 58, TableProvider::statistics(), and an embedded rowid design that conflicts with the fork's virtual-column work.
This ticket is the foundation — the prerequisite work that unblocks every other ticket in the set. Until this is done, no individual feature can be cleanly upstreamed.
Reference branch
ducklake-features/integration — read this branch's docs/ directory first, especially:
Delete empty test scaffolds. The branch has ~30 test files that compile but contain zero #[test] functions — edge_case_tests.rs, merge_tests.rs, update_tests.rs, sql_dml_tests.rs, interop_type_tests.rs, cross_engine_*.rs, etc. Find them with for f in tests/*.rs; do n=$(rg -c "^#\[(tokio::)?test\]" "$f"); [ "${n:-0}" -eq 0 ] && echo "$f"; done and delete each. (Note: any test that genuinely covers a feature already has a populated file with the same scope — tests/update_tests.rs IS populated on the branch; the empty ones tend to be alternate-named stubs.)
Fix the parity_basic_crud_after_insert test. It fails on a decimal formatting parity issue between DuckDB and DataFusion ("20" vs "20.000000"). Either align the formatter or scope the assertion to the value, not the string.
Triage the 96 SLT failures in tests/sqllogictest_runner.rs. Categorize each by root cause (real conformance gap, harness quirk, type-format mismatch, unsupported syntax). Open follow-up tickets per category if needed — do not try to fix them all here. The branch's own docs/slt-failure-report.md is a starting point.
Drop src/compaction_functions.rs entirely. It is a DuckDB pass-through that shells out to the ducklake DuckDB extension via an in-memory ATTACH. Per project direction, no feature should depend on libduckdb being present at runtime for the actual logic. Compaction will be reintroduced later as a native implementation.
Push the rebased branch as a fresh integration target for the downstream tickets. Suggested name: upstream-port/integration.
Acceptance criteria
cargo build --all-features clean on the rebased branch
cargo test passes at least the active (non-SLT) test surface (target: parity with audit — 827+ passing, 0 ignored outside external-service tests, 0 failures outside the cataloged SLT triage)
cargo clippy --all-targets produces no new warnings vs upstream main
No empty test scaffold files remain
src/compaction_functions.rs is removed (and its tests, if any)
The only remaining duckdb crate references are in src/metadata_provider_duckdb.rs, src/error.rs, and src/lib.rs (catalog backend, not feature implementations)
SLT failures categorized in a follow-up doc/issue
Rebased branch pushed and linked from this issue
Out of scope
Implementing any of the individual feature workstreams (separate tickets)
Reconciling the virtual_column_exec vs row_id design (separate ticket)
Filling in genuinely-missing test coverage (separate tickets per feature)
Notes
Audit-identified correctness bug to fix as part of rebase if it survives: src/cdc_common.rs:107-114 — duplicate-column projection collapses (SELECT col, col returns only one source position), and the test test_cdc_projection_duplicate_columns asserts the buggy behavior. Do not fix here — fix in the CDC ticket. Just don't let it regress.
The branch's docs/ directory (56 files) is agent process artifacts (R1–R11 review cycles, retrospectives, gap analyses). It should NOT be carried into upstream PRs. Move it to a separate archive branch or .audit/ directory after rebase.
Context
This repo is a fork of datafusion-contrib/datafusion-ducklake, a DataFusion extension that provides read access to DuckLake catalogs.
A few months ago an agent team did substantial work on the
ducklake-features/integrationbranch (265 commits, +23k net src LOC) to make this repo "1.0" feature-compliant with the DuckLake spec — DML write path, cross-backend metadata writers (PG / MySQL), CDC table functions, virtual columns, ALTER/DROP DDL, type-system fixes, and many correctness fixes.The work was audited in 2026-05 and judged genuinely high-quality, but it cannot be merged as-is: it branched from upstream commit
59eb3da(PR datafusion-contrib#79) and upstream has since landed datafusion 53 / arrow 58,TableProvider::statistics(), and an embeddedrowiddesign that conflicts with the fork's virtual-column work.This ticket is the foundation — the prerequisite work that unblocks every other ticket in the set. Until this is done, no individual feature can be cleanly upstreamed.
Reference branch
ducklake-features/integration— read this branch'sdocs/directory first, especially:docs/project-status.mddocs/2026-03-05-retrospective.mddocs/2026-03-07-snapshot-awareness-audit.mdScope
ducklake-features/integrationonto current upstream main (datafusion-contrib/datafusion-ducklake:main). Conflicts are expected in:src/metadata_provider_*.rs(upstream PR fix(writer): align ducklake_column / ducklake_data_file schema with the DuckLake spec datafusion-contrib/datafusion-ducklake#116 reshaped the schema),src/row_id.rsvssrc/virtual_column_exec.rs(upstream PR feat: DuckLake row lineage (rowid virtual column) datafusion-contrib/datafusion-ducklake#115 added a competing design — leave the conflict in place for ticket fix: preserve review-52 delete-file error handling follow-up #11 to resolve),src/types.rs(partial overlap with upstream PRs fix: normalize type aliases and add promotion rules for schema evolution datafusion-contrib/datafusion-ducklake#82/feat: support list/array column types in DuckLake type mapping datafusion-contrib/datafusion-ducklake#89). For each conflict, prefer keeping the fork's superset behavior but adopting the upstream type/schema shape so the rest of upstream still compiles against it.#[test]functions —edge_case_tests.rs,merge_tests.rs,update_tests.rs,sql_dml_tests.rs,interop_type_tests.rs,cross_engine_*.rs, etc. Find them withfor f in tests/*.rs; do n=$(rg -c "^#\[(tokio::)?test\]" "$f"); [ "${n:-0}" -eq 0 ] && echo "$f"; doneand delete each. (Note: any test that genuinely covers a feature already has a populated file with the same scope —tests/update_tests.rsIS populated on the branch; the empty ones tend to be alternate-named stubs.)parity_basic_crud_after_inserttest. It fails on a decimal formatting parity issue between DuckDB and DataFusion ("20"vs"20.000000"). Either align the formatter or scope the assertion to the value, not the string.tests/sqllogictest_runner.rs. Categorize each by root cause (real conformance gap, harness quirk, type-format mismatch, unsupported syntax). Open follow-up tickets per category if needed — do not try to fix them all here. The branch's owndocs/slt-failure-report.mdis a starting point.src/compaction_functions.rsentirely. It is a DuckDB pass-through that shells out to theducklakeDuckDB extension via an in-memoryATTACH. Per project direction, no feature should depend on libduckdb being present at runtime for the actual logic. Compaction will be reintroduced later as a native implementation.upstream-port/integration.Acceptance criteria
cargo build --all-featuresclean on the rebased branchcargo testpasses at least the active (non-SLT) test surface (target: parity with audit — 827+ passing, 0 ignored outside external-service tests, 0 failures outside the cataloged SLT triage)cargo clippy --all-targetsproduces no new warnings vs upstream mainsrc/compaction_functions.rsis removed (and its tests, if any)duckdbcrate references are insrc/metadata_provider_duckdb.rs,src/error.rs, andsrc/lib.rs(catalog backend, not feature implementations)Out of scope
virtual_column_execvsrow_iddesign (separate ticket)Notes
src/cdc_common.rs:107-114— duplicate-column projection collapses (SELECT col, colreturns only one source position), and the testtest_cdc_projection_duplicate_columnsasserts the buggy behavior. Do not fix here — fix in the CDC ticket. Just don't let it regress.docs/directory (56 files) is agent process artifacts (R1–R11 review cycles, retrospectives, gap analyses). It should NOT be carried into upstream PRs. Move it to a separate archive branch or.audit/directory after rebase.