Trino create lots of delete files for Iceberg tables partitioned with only identity() transformations

We’re upgrading from a quite old Trino version to latest v479 and observed what seems like a regression.

In short: when `delete from t ...` or `merge into t ...` statement is executed on an Iceberg table partitioned by raw columns with no transformation function, e.g. `array['date_col']`, Trino creates a lot of positional delete files per each data file. 
 

It’s fairly simple to reproduce on the current `master` with the dev-server (`iceberg` catalog with an actual Hive MetaStore catalog and using native S3 FS):
```sql
create table faker.default.source (
 uid uuid not null,
 id integer not null with (min = '0', max = '9')
);

create table iceberg.default.test
with (partitioning = array['id'])
as
select *
from faker.default.source
limit 1000;

delete from iceberg.default.test
where uid in (
 select uid
 from iceberg.default.test
 order by rand()
 limit 300
);
```

locally, I’m getting 11-23 delete files per partition:

```sql
select partition, count() as count
from iceberg.default."test$files"
where content = 1
group by 1
order by 2 desc;
```
| partition | count |
| :--- | :--- |
| {id=7} | 23 |
| {id=3} | 21 |
| {id=4} | 20 |
| {id=0} | 20 |
| {id=8} | 20 |
| {id=2} | 19 |
| {id=1} | 19 |
| {id=5} | 18 |
| {id=9} | 17 |
| {id=6} | 11 |

In one of the production cases with the table that has ~3k partitions (`partitioning = array['date_created']`) and 5.9k data files, while deleting 8M rows Trino v479 created 5.1M delete files, making the table quite slow to read afterwards.

Switching to any transformation function, e.g. `bucket(id, 10)` in a toy case or `day(date_created)`, resolves the problem — Trino then creates 1 delete file per partition/data file.
 

Bisecting the git history, I ended up with https://github.com/trinodb/trino/commit/7007b462dad as a first bad commit, from https://github.com/trinodb/trino/pull/26104, released with Trino v477.
 

Frankly, I don’t understand a lot here, so the puzzle in my mind is incomplete, but from what I gathered:

1. `IcebergMetadata#getWriteLayout()` has a special case for tables with only `identity()` transformations in `PartitionSpec`, which returns a `ConnectorTableLayout` instance with `partitioning` field set to `Optional.empty()`: https://github.com/trinodb/trino/blob/b29013dd5b36bb446bab2a2e2021a75355c91cce/plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergMetadata.java#L1483-L1486
2. That `ConnectorTableLayout` is then returned from `IcebergMetadata#getInsertLayout()`, but also `IcebergMetadata#getUpdateLayout()` returns `Optional.empty()` as it depends on `partitioning` from `getInsertLayout()`: https://github.com/trinodb/trino/blob/b29013dd5b36bb446bab2a2e2021a75355c91cce/plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergMetadata.java#L3346-L3352
3. _magic happens_
4. Presumably, missing update layout ends up causing Trino to create as many as allowed concurrent writers for delete files… or something
 

The special case for `identity()` partitioning in `IcebergMetadata#getWriteLayout()` looks the most obviously suspicious to me, as the comment there states that _“engine can determine whether to repartition data or not, on stat-based basis”_ but I don’t think that’s what is happening in the case of deletes. And, yes, simply removing this `if` resolves the problem, but performance is hard and I don’t know what that would break.

Given that `7007b462dad` among other things updates how `bucketCount` is calculated in `LocalExchange`, when I saw `bucketCount` set there to 8192 for `MergePartitioningHandle` (result of `NodePartitioningManager#getDefaultBucketCount()`), while before the change it was `1` (because `LocalExchange` was calling `NodePartitioningManager#getBucketNodeMap()`), I thought it was the cause of the issue, but while writing this I found that removing special case in `IcebergMetadata` doesn’t reduce `bucketCount` — it’s still set to 8192 and yet it works. So that was a dead end. 
 

After spending last 4 days on investigating this, I need help to not waste more time on learning Trino insides the hard way (though fun it is). 🙏

	public Optional<ConnectorPartitioningHandle> getUpdateLayout(ConnectorSession session, ConnectorTableHandle tableHandle)
	{
	return getInsertLayout(session, tableHandle)
	.flatMap(ConnectorTableLayout::getPartitioning)
	.map(IcebergPartitioningHandle.class::cast)
	.map(IcebergPartitioningHandle::forUpdate);
	}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trino create lots of delete files for Iceberg tables partitioned with only identity() transformations #28276

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

partition	count
{id=7}	23
{id=3}	21
{id=4}	20
{id=0}	20
{id=8}	20
{id=2}	19
{id=1}	19
{id=5}	18
{id=9}	17
{id=6}	11

	if (!forceRepartitioning && partitionSpec.fields().stream().allMatch(field -> field.transform().isIdentity())) {
	// Do not set partitioningHandle, to let engine determine whether to repartition data or not, on stat-based basis.
	return Optional.of(new ConnectorTableLayout(partitioningColumnNames));
	}

Trino create lots of delete files for Iceberg tables partitioned with only identity() transformations #28276

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions