Skip to content

Fix incorrect data when flushing inlined deletes with sorting#1166

Open
Alex-Monahan wants to merge 5 commits into
duckdb:v1.5-variegatafrom
Alex-Monahan:inline-flush-sort-deletes-minimal
Open

Fix incorrect data when flushing inlined deletes with sorting#1166
Alex-Monahan wants to merge 5 commits into
duckdb:v1.5-variegatafrom
Alex-Monahan:inline-flush-sort-deletes-minimal

Conversation

@Alex-Monahan
Copy link
Copy Markdown
Contributor

This PR fixes #999 .

However, there is a previously silently failing condition that this now causes to error, so it is a FIXME for a follow up PR. It will fail if a single transaction contains:

  1. A CREATE MACRO for a DuckLake macro
  2. ASET SORTED BY that uses that macro
  3. An inline flush will fail with an error message that the function is missing.

This is because the connection to the metadata DB does not have the user's DuckLake transaction context where that macro exists.

I wanted to split it apart since the PR was growing in scope and I think I need some help on how to cleanly handle within-transaction DuckLake macros in this case. This PR at least removes the silent corruption issue and throws an error in the very specific case above instead of a silent failure.

Alex-Monahan and others added 4 commits May 15, 2026 15:41
…kdb#999)

Issue: ducklake_flush_inlined_data marks the wrong rows as deleted when
the table's SET SORTED BY clause uses an expression (e.g. (id + 0) ASC)
and inlined deletes from prior UPDATEs are pending. The data file is
written sorted by the user expression, but the deletes-position query
falls back to ordering by row_id, begin_snapshot only -- so positions
diverge from the file's order and the wrong rows get marked deleted.

Two minimal cooperating fixes:

1. BuildSortOrderSQL (src/storage/ducklake_sort_data.cpp): replace the
   non-simple-column short-circuit with parse -> optionally rewrite
   ColumnRef leaves via the rename map -> ToString. Non-bare-column
   expressions now land in the deletes-position query's ORDER BY,
   matching the file's sort. Parser::ParseExpressionList rejects
   malformed input; ParsedExpression::ToString is the canonical
   re-serializer.

2. Tiebreakers in InsertSort (src/functions/ducklake_compaction_functions.*
   + src/functions/ducklake_flush_inlined_data.cpp): InsertSort gains an
   optional add_tiebreakers parameter. When true, scans the underlying
   LogicalGet's column_ids for COLUMN_IDENTIFIER_ROW_ID and
   COLUMN_IDENTIFIER_SNAPSHOT_ID symbolic identifiers, builds
   BoundColumnRefExpressions at the corresponding plan binding positions,
   and appends them as ASC NULLS LAST tiebreakers. Position-by-scan
   (not by latest_table.PhysicalColumnCount()) handles ADD COLUMN in
   the same transaction as the flush. The deletes-position query is
   tightened to use explicit ASC NULLS LAST for the tiebreakers so the
   user's default_null_order setting cannot desync them from the file.
   The flush call site passes add_tiebreakers=true; compaction passes
   nothing (default false).

Out of scope (intentionally): DuckLake-stored macros referenced in sort
expressions. With this change such expressions are emitted verbatim
into the deletes-position query and the metadata connection's binder
errors at flush time with "Catalog Error: Scalar Function with name X
does not exist!". This is strictly better than v1.5's silent wrong
delete positions, but it is a behavior change for the existing test
data_inlining_flush_sorted_macro_expression_transaction.test (the only
v1.5 test exercising this path with no UPDATEs, where the silent
short-circuit produced "passing" results coincidentally). That test
is updated to expect the clear error and roll back. Proper support
for macros in sort expressions would require cross-ClientContext
catalog visibility plumbing and is a separate concern.

New tests:
- data_inlining_flush_sorted_expression_deletes_999.test
- data_inlining_flush_sorted_expression_deletes_999_transaction.test

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@Alex-Monahan Alex-Monahan marked this pull request as ready for review May 15, 2026 23:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant