Skip to content

Reconcile virtual_column_exec with upstream row_id (DESIGN required first) #22

@zfarrell

Description

@zfarrell

Context

This is one ticket in a series carrying forward #12 foundation work. Read #12 first for repo context.

This is the highest-risk ticket in the set because it requires a design reconciliation, not just a port. The fork added src/virtual_column_exec.rs (5 virtual columns: rowid, snapshot_id, file_index, filename, file_row_number). Independently, upstream landed src/row_id.rs in PR datafusion-contrib#115, providing a single rowid virtual column via a different mechanism (an embedded _ducklake_internal_row_id field-id protocol).

We need one cohesive design, not two. The fork's surface is a superset, but upstream's design choice (embedded field-id) is more elegant. The right answer probably involves keeping the upstream mechanism and extending it to cover the remaining four columns.

Reference branches

Scope

  1. Read both implementations end-to-end. Spend time on this — do not implement before understanding the design intent of each.
  2. Read the DuckLake spec section on virtual columns to confirm which columns are spec-mandated vs nice-to-have.
  3. Propose a unified design in a comment on this issue before implementing. The likely shape:
    • Keep upstream's embedded-field-id mechanism for rowid (it interoperates with the DuckLake spec's row-lineage convention)
    • Add snapshot_id, file_index, filename, file_row_number as additional virtual columns under the same mechanism
    • Or, if the mechanisms genuinely cannot unify, document why and pick one cleanly
  4. Implement the agreed design. Replace both src/virtual_column_exec.rs and the supersedable parts of src/row_id.rs.
  5. Port over the fork's tests (virtual_column_tests.rs, virtual_column_extended_tests.rs) — adjust assertions to match the chosen mechanism, but preserve the behavioral coverage (multi-file rowid contiguity, WHERE-clause filtering on virtual columns, NULL handling when row_id_start/snapshot_id is missing in the catalog).

Acceptance criteria

  • Design proposal posted as a comment and approved before implementation begins
  • All 5 virtual columns work via a single mechanism
  • Existing upstream rowid semantics preserved (no regression for downstream consumers)
  • tests/virtual_column_tests.rs and tests/virtual_column_extended_tests.rs adapted and passing
  • WHERE-clause pushdown: SELECT * FROM t WHERE rowid = N does the right thing (filter pushdown is correct, no full scan when avoidable)
  • Single-partition coercion preserved when file_row_number or rowid is requested (fork uses CoalescePartitionsExec for this)
  • Checked arithmetic for rowid overflow preserved
  • No duckdb crate imports

Dependencies

Out of scope

  • New virtual columns beyond the 5-column union — _change_type for CDC views is a separate ticket
  • Writeable virtual columns — not a thing in DuckLake

Notes

  • Audit verdict on the fork's virtual_column_exec.rs: "solid — correctly forces single partition when row-number-dependent columns are requested via CoalescePartitionsExec, uses checked arithmetic for rowid overflow, handles missing row_id_start/snapshot_id as null."
  • The reconciliation matters because wrong choice here means downstream consumers either break or have to migrate twice. This is the one ticket where "going slower" is the right answer.
  • If reconciliation reveals upstream's embedded-field-id design has a fundamental gap (e.g., can't express filename without breaking the embedding), that's evidence the upstream design needs to evolve — raise it as an upstream issue rather than silently working around it.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions