Add fields for $files table delete file deduplication#28933
Add fields for $files table delete file deduplication#28933kaveti wants to merge 2 commits intotrinodb:masterfrom
Conversation
3174be5 to
2399782
Compare
| pageBuilder.declarePosition(); | ||
| long start = System.nanoTime(); | ||
| ContentFile<?> contentFile = contentIterator.next(); | ||
| long start = System.nanoTime(); |
There was a problem hiding this comment.
Revert the unrelated change
| } | ||
|
|
||
| @Test | ||
| public void testFilesTableDeleteFileDeduplication() |
There was a problem hiding this comment.
Could you add the test in v3?
I initially thought the request was to include offset and length to describe the delete file, so it would be good to have a test that demonstrates we currently don't support this
cc @findinpath
fa89958 to
a2e9b9e
Compare
| public static final String EQUALITY_IDS_COLUMN_NAME = "equality_ids"; | ||
| public static final String SORT_ORDER_ID_COLUMN_NAME = "sort_order_id"; | ||
| public static final String CONTENT_OFFSET_COLUMN_NAME = "content_offset"; | ||
| public static final String CONTENT_SIZE_IN_BYTES_COLUMN_NAME = "content_size_in_bytes"; |
There was a problem hiding this comment.
Please update $files section in iceberg.md.
| .map(row -> (String) row.getField(0))) | ||
| .doesNotContain("content_offset", "content_size_in_bytes"); | ||
| // Each DV entry has distinct content_offset within the shared Puffin file | ||
| assertQuery( |
There was a problem hiding this comment.
assertQuery isn't recommend. Use assertThat(query(...)) instead.
Add testFilesTableDeleteFileDeduplication to BaseIcebergSystemTables that verifies the $files table shows each delete file exactly once, with no duplicate entries (v2 position + equality deletes). Add testFilesTableDeletionVectors that verifies v3 deletion vector behavior: multiple DV entries share the same Puffin file_path in the $files table. Currently there are no content_offset/content_size_in_bytes columns to distinguish individual DVs within the shared Puffin file. Follow-up to trinodb#28911 as requested by findinpath.
a2e9b9e to
499496c
Compare
|
Could we move forward with: #29044 ? It adds the same fields + some more, how do folks feel about it? |
|
@oskar-szwajkowski we can break it two PRs or have it in one & close this one . i'm good with anything |
|
lets see how @chenjian2664 sees it |
Add testFilesTableDeleteFileDeduplication to BaseIcebergSystemTables that verifies the $files table shows each delete file exactly once, with no duplicate entries from FileScanTask expansion.
Follow-up to #28911 as requested by findinpath.
New Fields
content_offsetBIGINTvectors stored in shared Puffin files.
content_size_in_bytesBIGINTvectors stored in shared Puffin files.
Description
Additional context and related issues
Release notes
( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
( ) Release notes are required, with the following suggested text: