-
Notifications
You must be signed in to change notification settings - Fork 3.4k
[Iceberg v3] Deletion vector #27788
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Iceberg v3] Deletion vector #27788
Conversation
|
Are the build failures in iceberg related? |
Yes. I have a tivial mistake in here. fixing :D |
e9c66e6 to
24fda38
Compare
ebyhr
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The implementation is still broken. DefaultDeletionVectorWriter will throw duplicate key error if you run TestIcebergParquetConnectorTest with v3.
iceberg module test is green. Do we miss some test coverage? |
24fda38 to
4b76aa3
Compare
|
I patched the problem. The core issue is in some cases you get duplicate delete entries for the same file. The fix was trivial, but I'm not sure how you would test this reliabily at scale to trigger it. Maybe we should add a copy of the larger smoke test that runs on v3 with the ~35 tests that call optimize disabled. In the long run I expect we may want a hard coded v2 test as everything moves to v3 |
plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/delete/PositionDeleteFilter.java
Show resolved
Hide resolved
plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/delete/DeletionVector.java
Show resolved
Hide resolved
ee3a9d0 to
d6f077e
Compare
plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/delete/DeletionVector.java
Outdated
Show resolved
Hide resolved
| public static final class Builder | ||
| { | ||
| // key = (int) (pos >>> 32), value bitmap contains (int) pos low bits | ||
| private final Int2ObjectMap<RoaringBitmap> deletedRows = new Int2ObjectOpenHashMap<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we avoid using the Map by using similar logic as org.apache.iceberg.deletes.RoaringPositionBitmap#set ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I rewrote this a bunch of times. I don't think it matters either way at this point, but I'll take a look at going back to an array here also. BTW, I think the iceberg version is a fork of the roaring bitmap long bitmap code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
JFYI there is io.trino.plugin.deltalake.delete.RoaringBitmapArray in delta for a similar use case of storing position deletes in 32-bit roaring bitmaps (better efficiency than 64-bit roaring bitmaps) while still allowing a large positions range that is big enough for practical purposes.
plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/delete/DeletionVector.java
Show resolved
Hide resolved
plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/delete/DeletionVector.java
Outdated
Show resolved
Hide resolved
plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/delete/DeletionVector.java
Outdated
Show resolved
Hide resolved
| { | ||
| for (int key = 0; key < deletedRows.length; key++) { | ||
| RoaringBitmap bitmap = deletedRows[key]; | ||
| if (bitmap != null && !bitmap.isEmpty()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit confused to see this check in multiple places, how would we end up with null or empty bitmap ?
Could we just eliminate those once in constructor ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The array position is null for any section that does not have a deletion. This avoids having to fill the array with empty bitmaps.
d562376 to
fa2c669
Compare
plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/delete/DeleteManager.java
Outdated
Show resolved
Hide resolved
plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/delete/DeletionVectorWriter.java
Show resolved
Hide resolved
.../trino-iceberg/src/main/java/io/trino/plugin/iceberg/delete/DefaultDeletionVectorWriter.java
Outdated
Show resolved
Hide resolved
.../trino-iceberg/src/main/java/io/trino/plugin/iceberg/delete/DefaultDeletionVectorWriter.java
Show resolved
Hide resolved
.../trino-iceberg/src/main/java/io/trino/plugin/iceberg/delete/DefaultDeletionVectorWriter.java
Show resolved
Hide resolved
.../trino-iceberg/src/main/java/io/trino/plugin/iceberg/delete/DefaultDeletionVectorWriter.java
Outdated
Show resolved
Hide resolved
.../trino-iceberg/src/main/java/io/trino/plugin/iceberg/delete/DefaultDeletionVectorWriter.java
Outdated
Show resolved
Hide resolved
.../trino-iceberg/src/main/java/io/trino/plugin/iceberg/delete/DefaultDeletionVectorWriter.java
Show resolved
Hide resolved
.../trino-iceberg/src/main/java/io/trino/plugin/iceberg/delete/DefaultDeletionVectorWriter.java
Show resolved
Hide resolved
.../trino-iceberg/src/main/java/io/trino/plugin/iceberg/delete/DefaultDeletionVectorWriter.java
Outdated
Show resolved
Hide resolved
plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/delete/DeletionVector.java
Show resolved
Hide resolved
| } | ||
|
|
||
| @Test | ||
| void testV2ToV3MigrationWithDeletes() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you also please add a case for the table existing equality deletes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need this. equality deletes and positions delets have always been separate independent systems. I don't think we don't need to write a complex test to show that.
|
I responded to or applied all comments. |
| }) | ||
| .toList(); | ||
|
|
||
| deletionVectorWriter.writeDeletionVectors(session, icebergTable, table, deletionVectorInfos, rowDelta); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dain could you clarify why we're writing the deletion vectors from the coordinator ?
I think this can incur reads of previous deletes and require significant resources.
Can the "single deletion vector per data file" requirement not be met if we write this from the worker nodes ?
cc: @chenjian2664 @ebyhr
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it can be met. I tested this idea before by running the test suite with a requirement that we only get one DV on the coordinator pre data file, and I ended up with multiple DVs. That said, I think this is the right approach. The DVs are quite small, even for large deletes. They must be combined into a single DV per datafile, along with the any preexisting DVs (or position delete files).
This PR works by transporting the DVs from the workers to the coordinator via the fragments. It is possible that we may want to transport these via storage, but the latency cost would be quite high. I condiered this and decided we should wait for production feedback.
Another, thing we should consider is the cost during switch over from v2 position deletes, to v3 DV. This will cause some more load on coordinators during the transitions. IMO the best mitigation here is to add an optimize deletes table prodcedure. I also think we should document that as the preferred approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you share in what situations you have encountered multiple deletion vectors for the same data file?
If all delete files are available to the worker while writing the deletion vector, we could merge them into a single DV. Does that approach make sense to you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. Imagine you are deletting from a table that has only one file. The delete is written as delete any row that exists in another table. The other table is larger so you get a distributed join. The rows from that one file will be distributed to every machine.
plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/delete/DeletionVector.java
Show resolved
Hide resolved
|
|
||
| public Optional<DeletionVector> build() | ||
| { | ||
| if (Arrays.stream(deletedRows).allMatch(bitmap -> bitmap == null || bitmap.isEmpty())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about add a bitmapCount, so we can use it check directly and we can maintain it in the method getOrCreateBitmap and get it in deserialize, also will simplify the serialize method
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I skipped this one. I thinkt he current design is good enoug
plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/delete/DeletionVector.java
Show resolved
Hide resolved
plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/delete/DeletionVector.java
Outdated
Show resolved
Hide resolved
|
|
||
| private static IntConsumer intToLongAdapter(int keyHigh, LongConsumer consumer) | ||
| { | ||
| return keyLow -> consumer.accept(((long) keyHigh << 32) | (keyLow & 0xFFFFFFFFL)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: ((long) keyHigh << 32) -> (((long) keyHigh) << 32)
make intent obvious
|
|
||
| public Builder addAll(Builder other) | ||
| { | ||
| for (int i = 0; i < other.deletedRows.length; i++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since the key is ordered, we could do it in a reverse way so we don't needs to expands the array in getOrCreateBitmap multi times
|
|
||
| public Builder addAll(DeletionVector deletionVector) | ||
| { | ||
| for (int i = 0; i < deletionVector.deletedRows.length; i++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: same as blow, we could do it in a reverse way
.../trino-iceberg/src/main/java/io/trino/plugin/iceberg/delete/DefaultDeletionVectorWriter.java
Outdated
Show resolved
Hide resolved
plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/delete/PositionDeleteReader.java
Show resolved
Hide resolved
.../trino-iceberg/src/main/java/io/trino/plugin/iceberg/delete/DefaultDeletionVectorWriter.java
Outdated
Show resolved
Hide resolved
plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/delete/DeletionVector.java
Outdated
Show resolved
Hide resolved
plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/delete/DeletionVector.java
Show resolved
Hide resolved
plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/delete/DeletionVector.java
Show resolved
Hide resolved
plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/delete/DeletionVector.java
Outdated
Show resolved
Hide resolved
plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/delete/DeleteManager.java
Outdated
Show resolved
Hide resolved
plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergMetadata.java
Show resolved
Hide resolved
plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergMetadata.java
Outdated
Show resolved
Hide resolved
plugin/trino-iceberg/src/test/java/io/trino/plugin/iceberg/TestIcebergV3.java
Show resolved
Hide resolved
plugin/trino-iceberg/src/test/java/io/trino/plugin/iceberg/TestIcebergV3.java
Outdated
Show resolved
Hide resolved
plugin/trino-iceberg/src/test/java/io/trino/plugin/iceberg/TestIcebergV3.java
Outdated
Show resolved
Hide resolved
plugin/trino-iceberg/src/test/java/io/trino/plugin/iceberg/delete/TestDeletionVector.java
Show resolved
Hide resolved
...roduct-tests/src/main/java/io/trino/tests/product/iceberg/TestIcebergSparkCompatibility.java
Show resolved
Hide resolved
| import io.airlift.json.JsonCodec; | ||
| import io.airlift.log.Logger; | ||
| import io.airlift.slice.Slice; | ||
| import io.airlift.slice.Slices; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When do we update OPTIMIZE_MAX_SUPPORTED_TABLE_VERSION and CLEANING_UP_PROCEDURES_MAX_SUPPORTED_TABLE_VERSION to 3? After merging row lineage PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it is in the row lineage PR.
.../trino-iceberg/src/main/java/io/trino/plugin/iceberg/delete/DefaultDeletionVectorWriter.java
Show resolved
Hide resolved
b37b736 to
68da3c0
Compare
Refactors the existing v2 position delete code to use DeletionVector instead of directly using RoaringBitmap. This simplifies the code and prepares for v3 deletion vector support.
| toPartitionData(partitionSpec, schema, task.partitionDataJson())); | ||
| }) | ||
| .toList(); | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The outcome of this PR is that have a single puffin file per snapshot that contains blobs for all deletes in the snapshot.
Would it make senses to write temporarily smaller delete files on the workers and consolidate them into one (if needed) on the coordinator side (and delete the small DV files) in IcebergMetadata.finishWrite ?
This approach would solve the concerns of having small delete files scattered in the metadata and would still keep the functionality that this PR provides at the expense of potentially doing additional temporary writes on the storage.
Also concerns of putting memory pressure on the coordinator would addressed as well.
Description
This PR adds support for Iceberg format v3 deletion vectors in the Trino Iceberg connector.
Read path:
deletion-vector-v1blobs referenced from delete file entries.Write path:
Testing:
Release notes
(X) Release notes are required, with the following suggested text: