-
Notifications
You must be signed in to change notification settings - Fork 3.5k
Description
We’re upgrading from a quite old Trino version to latest v479 and observed what seems like a regression.
In short: when delete from t ... or merge into t ... statement is executed on an Iceberg table partitioned by raw columns with no transformation function, e.g. array['date_col'], Trino creates a lot of positional delete files per each data file.
It’s fairly simple to reproduce on the current master with the dev-server (iceberg catalog with an actual Hive MetaStore catalog and using native S3 FS):
create table faker.default.source (
uid uuid not null,
id integer not null with (min = '0', max = '9')
);
create table iceberg.default.test
with (partitioning = array['id'])
as
select *
from faker.default.source
limit 1000;
delete from iceberg.default.test
where uid in (
select uid
from iceberg.default.test
order by rand()
limit 300
);locally, I’m getting 11-23 delete files per partition:
select partition, count() as count
from iceberg.default."test$files"
where content = 1
group by 1
order by 2 desc;| partition | count |
|---|---|
| {id=7} | 23 |
| {id=3} | 21 |
| {id=4} | 20 |
| {id=0} | 20 |
| {id=8} | 20 |
| {id=2} | 19 |
| {id=1} | 19 |
| {id=5} | 18 |
| {id=9} | 17 |
| {id=6} | 11 |
In one of the production cases with the table that has ~3k partitions (partitioning = array['date_created']) and 5.9k data files, while deleting 8M rows Trino v479 created 5.1M delete files, making the table quite slow to read afterwards.
Switching to any transformation function, e.g. bucket(id, 10) in a toy case or day(date_created), resolves the problem — Trino then creates 1 delete file per partition/data file.
Bisecting the git history, I ended up with 7007b462dad as a first bad commit, from #26104, released with Trino v477.
Frankly, I don’t understand a lot here, so the puzzle in my mind is incomplete, but from what I gathered:
IcebergMetadata#getWriteLayout()has a special case for tables with onlyidentity()transformations inPartitionSpec, which returns aConnectorTableLayoutinstance withpartitioningfield set toOptional.empty():trino/plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergMetadata.java
Lines 1483 to 1486 in b29013d
if (!forceRepartitioning && partitionSpec.fields().stream().allMatch(field -> field.transform().isIdentity())) { // Do not set partitioningHandle, to let engine determine whether to repartition data or not, on stat-based basis. return Optional.of(new ConnectorTableLayout(partitioningColumnNames)); } - That
ConnectorTableLayoutis then returned fromIcebergMetadata#getInsertLayout(), but alsoIcebergMetadata#getUpdateLayout()returnsOptional.empty()as it depends onpartitioningfromgetInsertLayout():trino/plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergMetadata.java
Lines 3346 to 3352 in b29013d
public Optional<ConnectorPartitioningHandle> getUpdateLayout(ConnectorSession session, ConnectorTableHandle tableHandle) { return getInsertLayout(session, tableHandle) .flatMap(ConnectorTableLayout::getPartitioning) .map(IcebergPartitioningHandle.class::cast) .map(IcebergPartitioningHandle::forUpdate); } - magic happens
- Presumably, missing update layout ends up causing Trino to create as many as allowed concurrent writers for delete files… or something
The special case for identity() partitioning in IcebergMetadata#getWriteLayout() looks the most obviously suspicious to me, as the comment there states that “engine can determine whether to repartition data or not, on stat-based basis” but I don’t think that’s what is happening in the case of deletes. And, yes, simply removing this if resolves the problem, but performance is hard and I don’t know what that would break.
Given that 7007b462dad among other things updates how bucketCount is calculated in LocalExchange, when I saw bucketCount set there to 8192 for MergePartitioningHandle (result of NodePartitioningManager#getDefaultBucketCount()), while before the change it was 1 (because LocalExchange was calling NodePartitioningManager#getBucketNodeMap()), I thought it was the cause of the issue, but while writing this I found that removing special case in IcebergMetadata doesn’t reduce bucketCount — it’s still set to 8192 and yet it works. So that was a dead end.
After spending last 4 days on investigating this, I need help to not waste more time on learning Trino insides the hard way (though fun it is). 🙏