Skip to content

Add fields for $files table delete file deduplication#28933

Open
kaveti wants to merge 2 commits intotrinodb:masterfrom
kaveti:files-table-dedup
Open

Add fields for $files table delete file deduplication#28933
kaveti wants to merge 2 commits intotrinodb:masterfrom
kaveti:files-table-dedup

Conversation

@kaveti
Copy link
Copy Markdown
Contributor

@kaveti kaveti commented Mar 30, 2026

Add testFilesTableDeleteFileDeduplication to BaseIcebergSystemTables that verifies the $files table shows each delete file exactly once, with no duplicate entries from FileScanTask expansion.

Follow-up to #28911 as requested by findinpath.

New Fields

    • content_offset
    • BIGINT
    • The offset in the file where the content starts. Applicable for deletion
      vectors stored in shared Puffin files.
    • content_size_in_bytes
    • BIGINT
    • The length of referenced content stored in the file. Applicable for deletion
      vectors stored in shared Puffin files.

Description

Additional context and related issues

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
( ) Release notes are required, with the following suggested text:

## Section
* Fix some things. ({issue}`issuenumber`)

@cla-bot cla-bot bot added the cla-signed label Mar 30, 2026
@github-actions github-actions bot added the iceberg Iceberg connector label Mar 30, 2026
@kaveti kaveti force-pushed the files-table-dedup branch 2 times, most recently from 3174be5 to 2399782 Compare March 31, 2026 02:25
pageBuilder.declarePosition();
long start = System.nanoTime();
ContentFile<?> contentFile = contentIterator.next();
long start = System.nanoTime();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revert the unrelated change

}

@Test
public void testFilesTableDeleteFileDeduplication()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add the test in v3?
I initially thought the request was to include offset and length to describe the delete file, so it would be good to have a test that demonstrates we currently don't support this
cc @findinpath

@kaveti kaveti force-pushed the files-table-dedup branch 3 times, most recently from fa89958 to a2e9b9e Compare April 2, 2026 07:13
public static final String EQUALITY_IDS_COLUMN_NAME = "equality_ids";
public static final String SORT_ORDER_ID_COLUMN_NAME = "sort_order_id";
public static final String CONTENT_OFFSET_COLUMN_NAME = "content_offset";
public static final String CONTENT_SIZE_IN_BYTES_COLUMN_NAME = "content_size_in_bytes";
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update $files section in iceberg.md.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done @ebyhr

.map(row -> (String) row.getField(0)))
.doesNotContain("content_offset", "content_size_in_bytes");
// Each DV entry has distinct content_offset within the shared Puffin file
assertQuery(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assertQuery isn't recommend. Use assertThat(query(...)) instead.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done @ebyhr

kaveti added 2 commits April 7, 2026 11:10
Add testFilesTableDeleteFileDeduplication to BaseIcebergSystemTables
that verifies the $files table shows each delete file exactly once,
with no duplicate entries (v2 position + equality deletes).

Add testFilesTableDeletionVectors that verifies v3 deletion vector
behavior: multiple DV entries share the same Puffin file_path in the
$files table. Currently there are no content_offset/content_size_in_bytes
columns to distinguish individual DVs within the shared Puffin file.

Follow-up to trinodb#28911 as requested by findinpath.
@kaveti kaveti force-pushed the files-table-dedup branch from a2e9b9e to 499496c Compare April 7, 2026 05:58
@github-actions github-actions bot added the docs label Apr 7, 2026
@kaveti kaveti changed the title Add test for $files table delete file deduplication Add fields for $files table delete file deduplication Apr 9, 2026
@oskar-szwajkowski
Copy link
Copy Markdown
Contributor

Could we move forward with: #29044 ?

It adds the same fields + some more, how do folks feel about it?

@kaveti
Copy link
Copy Markdown
Contributor Author

kaveti commented Apr 10, 2026

@oskar-szwajkowski we can break it two PRs or have it in one & close this one . i'm good with anything

@oskar-szwajkowski
Copy link
Copy Markdown
Contributor

lets see how @chenjian2664 sees it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Development

Successfully merging this pull request may close these issues.

4 participants