Reconcile virtual_column_exec with upstream row_id (DESIGN required first)

## Context

This is one ticket in a series carrying forward [#12 foundation work](https://github.com/zfarrell/datafusion-ducklake/issues/12). **Read #12 first** for repo context.

This is the **highest-risk** ticket in the set because it requires a design reconciliation, not just a port. The fork added `src/virtual_column_exec.rs` (5 virtual columns: `rowid`, `snapshot_id`, `file_index`, `filename`, `file_row_number`). Independently, upstream landed `src/row_id.rs` in PR #115, providing a single `rowid` virtual column via a different mechanism (an embedded `_ducklake_internal_row_id` field-id protocol).

We need one cohesive design, not two. The fork's surface is a superset, but upstream's design choice (embedded field-id) is more elegant. The right answer probably involves keeping the upstream mechanism and extending it to cover the remaining four columns.

## Reference branches

- Fork: `ducklake-features/integration` — `src/virtual_column_exec.rs`, `tests/virtual_column_tests.rs` (6 tests, 7-column schema assertions), `tests/virtual_column_extended_tests.rs` (13 tests, multi-file catalogs, contiguity assertions)
- Upstream: `datafusion-contrib/datafusion-ducklake:main` — `src/row_id.rs`, PR #115 discussion

## Scope

1. Read both implementations end-to-end. Spend time on this — do not implement before understanding the design intent of each.
2. Read the DuckLake spec section on virtual columns to confirm which columns are spec-mandated vs nice-to-have.
3. **Propose a unified design** in a comment on this issue before implementing. The likely shape:
   - Keep upstream's embedded-field-id mechanism for `rowid` (it interoperates with the DuckLake spec's row-lineage convention)
   - Add `snapshot_id`, `file_index`, `filename`, `file_row_number` as additional virtual columns under the same mechanism
   - Or, if the mechanisms genuinely cannot unify, document why and pick one cleanly
4. Implement the agreed design. Replace both `src/virtual_column_exec.rs` and the supersedable parts of `src/row_id.rs`.
5. Port over the fork's tests (`virtual_column_tests.rs`, `virtual_column_extended_tests.rs`) — adjust assertions to match the chosen mechanism, but preserve the behavioral coverage (multi-file rowid contiguity, WHERE-clause filtering on virtual columns, NULL handling when `row_id_start`/`snapshot_id` is missing in the catalog).

## Acceptance criteria

- [ ] Design proposal posted as a comment and approved before implementation begins
- [ ] All 5 virtual columns work via a single mechanism
- [ ] Existing upstream `rowid` semantics preserved (no regression for downstream consumers)
- [ ] `tests/virtual_column_tests.rs` and `tests/virtual_column_extended_tests.rs` adapted and passing
- [ ] WHERE-clause pushdown: `SELECT * FROM t WHERE rowid = N` does the right thing (filter pushdown is correct, no full scan when avoidable)
- [ ] Single-partition coercion preserved when `file_row_number` or `rowid` is requested (fork uses `CoalescePartitionsExec` for this)
- [ ] Checked arithmetic for rowid overflow preserved
- [ ] No `duckdb` crate imports

## Dependencies

- Depends on: #12 (foundation)

## Out of scope

- New virtual columns beyond the 5-column union — `_change_type` for CDC views is a separate ticket
- Writeable virtual columns — not a thing in DuckLake

## Notes

- Audit verdict on the fork's `virtual_column_exec.rs`: "solid — correctly forces single partition when row-number-dependent columns are requested via CoalescePartitionsExec, uses checked arithmetic for rowid overflow, handles missing row_id_start/snapshot_id as null."
- The reconciliation matters because **wrong choice here means downstream consumers either break or have to migrate twice.** This is the one ticket where "going slower" is the right answer.
- If reconciliation reveals upstream's embedded-field-id design has a fundamental gap (e.g., can't express `filename` without breaking the embedding), that's evidence the upstream design needs to evolve — raise it as an upstream issue rather than silently working around it.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reconcile virtual_column_exec with upstream row_id (DESIGN required first) #22

Context

Reference branches

Scope

Acceptance criteria

Dependencies

Out of scope

Notes

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Reconcile virtual_column_exec with upstream row_id (DESIGN required first) #22

Description

Context

Reference branches

Scope

Acceptance criteria

Dependencies

Out of scope

Notes

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions