Skip to content

Conversation

@dain
Copy link
Member

@dain dain commented Dec 30, 2025

Description

This PR adds support for Iceberg format v3 deletion vectors in the Trino Iceberg connector.

Read path:

  • Recognizes v3 deletion vectors stored as Puffin deletion-vector-v1 blobs referenced from delete file entries.
  • Applies deletion vectors during reads using the row position column, integrating with the existing delete filtering pipeline.

Write path:

  • Enables row-level operations (DELETE/UPDATE/MERGE) for v3 tables by writing deletion vectors rather than legacy position delete files.
  • Supports v2 → v3 upgrade scenarios by merging existing v2 position delete files into deletion vectors once the table is upgraded.
  • Maintains a single deletion vector per data file and prefers deletion vectors over position delete files when present.

Testing:

  • Adds coverage for writing/reading deletion vectors in v3 tables (including multiple Puffin files mid-stream and convergence).
  • Adds coverage for v2 tables with deletes upgraded to v3, ensuring existing deletes remain effective and new deletes use deletion vectors.
  • Updates v3 “updates blocked” tests (DELETE/UPDATE/MERGE) to validate the operations now succeed.

Release notes

(X) Release notes are required, with the following suggested text:

## Iceberg
* Add support for Iceberg format v3 deletion vectors to enable DELETE/UPDATE/MERGE on v3 tables.

@cla-bot cla-bot bot added the cla-signed label Dec 30, 2025
@github-actions github-actions bot added the iceberg Iceberg connector label Dec 30, 2025
@findepi
Copy link
Member

findepi commented Dec 30, 2025

Are the build failures in iceberg related?

@dain
Copy link
Member Author

dain commented Dec 30, 2025

Are the build failures in iceberg related?

Yes. I have a tivial mistake in here. fixing :D

Copy link
Member

@ebyhr ebyhr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation is still broken. DefaultDeletionVectorWriter will throw duplicate key error if you run TestIcebergParquetConnectorTest with v3.

@findepi
Copy link
Member

findepi commented Dec 31, 2025

if you run TestIcebergParquetConnectorTest with v3.

iceberg module test is green. Do we miss some test coverage?
@ebyhr what would you recommend being tested?

@dain
Copy link
Member Author

dain commented Dec 31, 2025

I patched the problem. The core issue is in some cases you get duplicate delete entries for the same file. The fix was trivial, but I'm not sure how you would test this reliabily at scale to trigger it. Maybe we should add a copy of the larger smoke test that runs on v3 with the ~35 tests that call optimize disabled. In the long run I expect we may want a hard coded v2 test as everything moves to v3

@dain dain force-pushed the deletion-vector branch 2 times, most recently from ee3a9d0 to d6f077e Compare January 2, 2026 22:27
public static final class Builder
{
// key = (int) (pos >>> 32), value bitmap contains (int) pos low bits
private final Int2ObjectMap<RoaringBitmap> deletedRows = new Int2ObjectOpenHashMap<>();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we avoid using the Map by using similar logic as org.apache.iceberg.deletes.RoaringPositionBitmap#set ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I rewrote this a bunch of times. I don't think it matters either way at this point, but I'll take a look at going back to an array here also. BTW, I think the iceberg version is a fork of the roaring bitmap long bitmap code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JFYI there is io.trino.plugin.deltalake.delete.RoaringBitmapArray in delta for a similar use case of storing position deletes in 32-bit roaring bitmaps (better efficiency than 64-bit roaring bitmaps) while still allowing a large positions range that is big enough for practical purposes.

{
for (int key = 0; key < deletedRows.length; key++) {
RoaringBitmap bitmap = deletedRows[key];
if (bitmap != null && !bitmap.isEmpty()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused to see this check in multiple places, how would we end up with null or empty bitmap ?
Could we just eliminate those once in constructor ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The array position is null for any section that does not have a deletion. This avoids having to fill the array with empty bitmaps.

@dain dain force-pushed the deletion-vector branch 6 times, most recently from d562376 to fa2c669 Compare January 11, 2026 02:13
@dain dain changed the title Deletion vector [Iceberg v3] Deletion vector Jan 11, 2026
}

@Test
void testV2ToV3MigrationWithDeletes()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also please add a case for the table existing equality deletes?

Copy link
Member Author

@dain dain Jan 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need this. equality deletes and positions delets have always been separate independent systems. I don't think we don't need to write a complex test to show that.

@dain
Copy link
Member Author

dain commented Jan 13, 2026

I responded to or applied all comments.

@dain dain requested a review from raunaqmorarka January 13, 2026 19:14
})
.toList();

deletionVectorWriter.writeDeletionVectors(session, icebergTable, table, deletionVectorInfos, rowDelta);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dain could you clarify why we're writing the deletion vectors from the coordinator ?
I think this can incur reads of previous deletes and require significant resources.
Can the "single deletion vector per data file" requirement not be met if we write this from the worker nodes ?
cc: @chenjian2664 @ebyhr

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it can be met. I tested this idea before by running the test suite with a requirement that we only get one DV on the coordinator pre data file, and I ended up with multiple DVs. That said, I think this is the right approach. The DVs are quite small, even for large deletes. They must be combined into a single DV per datafile, along with the any preexisting DVs (or position delete files).

This PR works by transporting the DVs from the workers to the coordinator via the fragments. It is possible that we may want to transport these via storage, but the latency cost would be quite high. I condiered this and decided we should wait for production feedback.

Another, thing we should consider is the cost during switch over from v2 position deletes, to v3 DV. This will cause some more load on coordinators during the transitions. IMO the best mitigation here is to add an optimize deletes table prodcedure. I also think we should document that as the preferred approach.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you share in what situations you have encountered multiple deletion vectors for the same data file?

If all delete files are available to the worker while writing the deletion vector, we could merge them into a single DV. Does that approach make sense to you?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Imagine you are deletting from a table that has only one file. The delete is written as delete any row that exists in another table. The other table is larger so you get a distributed join. The rows from that one file will be distributed to every machine.


public Optional<DeletionVector> build()
{
if (Arrays.stream(deletedRows).allMatch(bitmap -> bitmap == null || bitmap.isEmpty())) {
Copy link
Contributor

@chenjian2664 chenjian2664 Jan 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about add a bitmapCount, so we can use it check directly and we can maintain it in the method getOrCreateBitmap and get it in deserialize, also will simplify the serialize method

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I skipped this one. I thinkt he current design is good enoug


private static IntConsumer intToLongAdapter(int keyHigh, LongConsumer consumer)
{
return keyLow -> consumer.accept(((long) keyHigh << 32) | (keyLow & 0xFFFFFFFFL));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: ((long) keyHigh << 32) -> (((long) keyHigh) << 32)
make intent obvious


public Builder addAll(Builder other)
{
for (int i = 0; i < other.deletedRows.length; i++) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since the key is ordered, we could do it in a reverse way so we don't needs to expands the array in getOrCreateBitmap multi times


public Builder addAll(DeletionVector deletionVector)
{
for (int i = 0; i < deletionVector.deletedRows.length; i++) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: same as blow, we could do it in a reverse way

import io.airlift.json.JsonCodec;
import io.airlift.log.Logger;
import io.airlift.slice.Slice;
import io.airlift.slice.Slices;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When do we update OPTIMIZE_MAX_SUPPORTED_TABLE_VERSION and CLEANING_UP_PROCEDURES_MAX_SUPPORTED_TABLE_VERSION to 3? After merging row lineage PR?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is in the row lineage PR.

@dain dain force-pushed the deletion-vector branch 2 times, most recently from b37b736 to 68da3c0 Compare January 16, 2026 05:52
dain added 3 commits January 15, 2026 22:29
Refactors the existing v2 position delete code to use DeletionVector
instead of directly using RoaringBitmap. This simplifies the code and
prepares for v3 deletion vector support.
@dain dain merged commit 0eb94ac into trinodb:master Jan 16, 2026
54 of 56 checks passed
@dain dain deleted the deletion-vector branch January 16, 2026 07:34
@github-actions github-actions bot added this to the 480 milestone Jan 16, 2026
toPartitionData(partitionSpec, schema, task.partitionDataJson()));
})
.toList();

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dain

The outcome of this PR is that have a single puffin file per snapshot that contains blobs for all deletes in the snapshot.

Would it make senses to write temporarily smaller delete files on the workers and consolidate them into one (if needed) on the coordinator side (and delete the small DV files) in IcebergMetadata.finishWrite ?

This approach would solve the concerns of having small delete files scattered in the metadata and would still keep the functionality that this PR provides at the expense of potentially doing additional temporary writes on the storage.
Also concerns of putting memory pressure on the coordinator would addressed as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Development

Successfully merging this pull request may close these issues.

7 participants