Context
This is one ticket in a series carrying forward #12 foundation work. Read #12 first for repo context.
This is the highest-risk ticket in the set because it requires a design reconciliation, not just a port. The fork added src/virtual_column_exec.rs (5 virtual columns: rowid, snapshot_id, file_index, filename, file_row_number). Independently, upstream landed src/row_id.rs in PR datafusion-contrib#115, providing a single rowid virtual column via a different mechanism (an embedded _ducklake_internal_row_id field-id protocol).
We need one cohesive design, not two. The fork's surface is a superset, but upstream's design choice (embedded field-id) is more elegant. The right answer probably involves keeping the upstream mechanism and extending it to cover the remaining four columns.
Reference branches
Scope
- Read both implementations end-to-end. Spend time on this — do not implement before understanding the design intent of each.
- Read the DuckLake spec section on virtual columns to confirm which columns are spec-mandated vs nice-to-have.
- Propose a unified design in a comment on this issue before implementing. The likely shape:
- Keep upstream's embedded-field-id mechanism for
rowid (it interoperates with the DuckLake spec's row-lineage convention)
- Add
snapshot_id, file_index, filename, file_row_number as additional virtual columns under the same mechanism
- Or, if the mechanisms genuinely cannot unify, document why and pick one cleanly
- Implement the agreed design. Replace both
src/virtual_column_exec.rs and the supersedable parts of src/row_id.rs.
- Port over the fork's tests (
virtual_column_tests.rs, virtual_column_extended_tests.rs) — adjust assertions to match the chosen mechanism, but preserve the behavioral coverage (multi-file rowid contiguity, WHERE-clause filtering on virtual columns, NULL handling when row_id_start/snapshot_id is missing in the catalog).
Acceptance criteria
Dependencies
Out of scope
- New virtual columns beyond the 5-column union —
_change_type for CDC views is a separate ticket
- Writeable virtual columns — not a thing in DuckLake
Notes
- Audit verdict on the fork's
virtual_column_exec.rs: "solid — correctly forces single partition when row-number-dependent columns are requested via CoalescePartitionsExec, uses checked arithmetic for rowid overflow, handles missing row_id_start/snapshot_id as null."
- The reconciliation matters because wrong choice here means downstream consumers either break or have to migrate twice. This is the one ticket where "going slower" is the right answer.
- If reconciliation reveals upstream's embedded-field-id design has a fundamental gap (e.g., can't express
filename without breaking the embedding), that's evidence the upstream design needs to evolve — raise it as an upstream issue rather than silently working around it.
Context
This is one ticket in a series carrying forward #12 foundation work. Read #12 first for repo context.
This is the highest-risk ticket in the set because it requires a design reconciliation, not just a port. The fork added
src/virtual_column_exec.rs(5 virtual columns:rowid,snapshot_id,file_index,filename,file_row_number). Independently, upstream landedsrc/row_id.rsin PR datafusion-contrib#115, providing a singlerowidvirtual column via a different mechanism (an embedded_ducklake_internal_row_idfield-id protocol).We need one cohesive design, not two. The fork's surface is a superset, but upstream's design choice (embedded field-id) is more elegant. The right answer probably involves keeping the upstream mechanism and extending it to cover the remaining four columns.
Reference branches
ducklake-features/integration—src/virtual_column_exec.rs,tests/virtual_column_tests.rs(6 tests, 7-column schema assertions),tests/virtual_column_extended_tests.rs(13 tests, multi-file catalogs, contiguity assertions)datafusion-contrib/datafusion-ducklake:main—src/row_id.rs, PR feat: DuckLake row lineage (rowid virtual column) datafusion-contrib/datafusion-ducklake#115 discussionScope
rowid(it interoperates with the DuckLake spec's row-lineage convention)snapshot_id,file_index,filename,file_row_numberas additional virtual columns under the same mechanismsrc/virtual_column_exec.rsand the supersedable parts ofsrc/row_id.rs.virtual_column_tests.rs,virtual_column_extended_tests.rs) — adjust assertions to match the chosen mechanism, but preserve the behavioral coverage (multi-file rowid contiguity, WHERE-clause filtering on virtual columns, NULL handling whenrow_id_start/snapshot_idis missing in the catalog).Acceptance criteria
rowidsemantics preserved (no regression for downstream consumers)tests/virtual_column_tests.rsandtests/virtual_column_extended_tests.rsadapted and passingSELECT * FROM t WHERE rowid = Ndoes the right thing (filter pushdown is correct, no full scan when avoidable)file_row_numberorrowidis requested (fork usesCoalescePartitionsExecfor this)duckdbcrate importsDependencies
Out of scope
_change_typefor CDC views is a separate ticketNotes
virtual_column_exec.rs: "solid — correctly forces single partition when row-number-dependent columns are requested via CoalescePartitionsExec, uses checked arithmetic for rowid overflow, handles missing row_id_start/snapshot_id as null."filenamewithout breaking the embedding), that's evidence the upstream design needs to evolve — raise it as an upstream issue rather than silently working around it.