Skip to content

Conversation

@sopel39
Copy link
Member

@sopel39 sopel39 commented Nov 18, 2025

iceberg.default-new-tables-gc.enabled will set gc.enabled flag for new tables created by Trino.

Description

Additional context and related issues

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(x) Release notes are required, with the following suggested text:

## Iceberg
* Honor and allow to set default value for `gc.enabled` flag ({issue}`issuenumber`)

@cla-bot cla-bot bot added the cla-signed label Nov 18, 2025
@github-actions github-actions bot added the iceberg Iceberg connector label Nov 18, 2025
@sopel39 sopel39 force-pushed the oss/default_new_tables_gc_enabled branch from 9b10773 to c52b655 Compare November 18, 2025 15:13
iceberg.default-new-tables-gc.enabled will set gc.enabled flag for new
tables created by Trino.
@sopel39 sopel39 force-pushed the oss/default_new_tables_gc_enabled branch from c52b655 to 94dcc5d Compare November 18, 2025 15:59
return defaultNewTablesGcEnabled;
}

@Config("iceberg.default-new-tables-gc.enabled")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iceberg.default-new-tables.gc.enabled


// If user doesn't set gc.enabled, we need to set it to defaultNewTablesGcEnabled value
if (!extraProperties.containsKey(GC_ENABLED) && defaultNewTablesGcEnabled != GC_ENABLED_DEFAULT) {
propertiesBuilder.put(GC_ENABLED, Boolean.toString(defaultNewTablesGcEnabled));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing to note about relying on GC_ENABLED property of iceberg is that it will also disallow expire_snapshots.
Is that something desirable ?
I can imagine that some users just want protection on DROP, but with this they can't run the regular maintenance operation of expiring old snapshots

Copy link
Member Author

@sopel39 sopel39 Nov 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess they could enable GC for expire_snapshots. Users often have side spark anyway. Unless there is another property that would just limit DROP, then I think it's reasonable to expect data admins to know how to handle this situation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a check in iceberg library org.apache.iceberg.RemoveSnapshots#RemoveSnapshots

ValidationException.check(
        PropertyUtil.propertyAsBoolean(base.properties(), GC_ENABLED, GC_ENABLED_DEFAULT),
        "Cannot expire snapshots: GC is disabled (deleting files may corrupt other tables)");

Using Spark or something else won't help with that.
I'm not sure how admin deals with this, adding and removing property before/after every expire_snapshots is cumbersome and removes the DROP protection for the duration of the command. Also, most ppl have automated maintenance services, and we wouldn't want ppl to have to go and modify that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like a breaking change to expire_snapshots?

But if other engine disable the gc flag, it's also a problem to Trino, user not able to expire snapshots due a unsupported or exposed property

.put("iceberg.file-format", format.name())
// Only allow some extra properties. Add "sorted_by" so that we can test that the property is disallowed by the connector explicitly.
.put("iceberg.allowed-extra-properties", "extra.property.one,extra.property.two,extra.property.three,sorted_by")
.put("iceberg.allowed-extra-properties", "extra.property.one,extra.property.two,extra.property.three,sorted_by,gc.enabled")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use BaseIcebergConnectorSmokeTest or BaseTrinoCatalogTest when we verify catalog behavior. I recommend adding at least one positive test case to BaseIcebergConnectorSmokeTest.

}

@Test
public void testDefaultGcEnabledTablePropertyOnCreateAndCtas() throws IOException
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Put throws IOException on a new line. Same for other tests.

return defaultNewTablesGcEnabled;
}

@Config("iceberg.default-new-tables-gc.enabled")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you document this property in iceberg.md?

private int planningThreads = Math.min(Runtime.getRuntime().availableProcessors(), 16);
private int fileDeleteThreads = Runtime.getRuntime().availableProcessors() * 2;
private List<String> allowedExtraProperties = ImmutableList.of();
private boolean defaultNewTablesGcEnabled = GC_ENABLED_DEFAULT;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The common approach is to restrain referencing in the config files values from dependent libraries because this may trigger a silent change in the default configuration of the connector.
Please consider using rather true instead

}

@Config("iceberg.default-new-tables-gc.enabled")
@ConfigDescription("Default value for Iceberg property gc.enabled when creating new tables")
Copy link
Contributor

@findinpath findinpath Nov 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add as well corresponding documentation in iceberg.md as well.
I noticed that the documentation for gc.enabled (introduced by apache/iceberg#1796 ) table property is unfortunately missing as well on apache/iceberg - As a reference point you could use apache/iceberg#9231

Property to disable garbage collection operations such as expiring snapshots or removing orphan files

It seems that in the meantime the table property has as well repercussions on whether to delete dropped table's files as well.

LOG.warn(e, "Failed to delete table data referenced by metadata");
}
deleteTableDirectory(fileSystemFactory.create(session), schemaTableName, table.location());
if (propertyAsBoolean(table.properties(), GC_ENABLED, GC_ENABLED_DEFAULT)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pls replace GC_ENABLED_DEFAULT with true - we don't have control on the value of GC_ENABLED_DEFAULT when the iceberg library gets updated.

log.warn(e, "Failed to delete table data referenced by metadata");
}
deleteTableDirectory(fileSystemFactory.create(session), schemaTableName, metastoreTable.getStorage().getLocation());
if (propertyAsBoolean(metadata.properties(), GC_ENABLED, GC_ENABLED_DEFAULT)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (propertyAsBoolean(metadata.properties(), GC_ENABLED, GC_ENABLED_DEFAULT)) {
if (propertyAsBoolean(metadata.properties(), GC_ENABLED, true)) {

LOG.warn(e, "Failed to delete table data referenced by metadata");
}
deleteTableDirectory(fileSystemFactory.create(session), schemaTableName, table.location());
if (propertyAsBoolean(table.properties(), GC_ENABLED, GC_ENABLED_DEFAULT)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (propertyAsBoolean(table.properties(), GC_ENABLED, GC_ENABLED_DEFAULT)) {
if (propertyAsBoolean(table.properties(), GC_ENABLED, true)) {

PartitionSpec partitionSpec = parsePartitionFields(schema, getPartitioning(materializedViewProperties));
SortOrder sortOrder = parseSortFields(schema, getSortOrder(materializedViewProperties));
Map<String, String> properties = createTableProperties(new ConnectorTableMetadata(storageTableName, columns, materializedViewProperties, Optional.empty()), _ -> false);
Map<String, String> properties = createTableProperties(new ConnectorTableMetadata(storageTableName, columns, materializedViewProperties, Optional.empty()), _ -> false, GC_ENABLED_DEFAULT);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Map<String, String> properties = createTableProperties(new ConnectorTableMetadata(storageTableName, columns, materializedViewProperties, Optional.empty()), _ -> false, GC_ENABLED_DEFAULT);
Map<String, String> properties = createTableProperties(new ConnectorTableMetadata(storageTableName, columns, materializedViewProperties, Optional.empty()), _ -> false, true);

String tableLocation = getTableLocation(tableMetadata.getProperties())
.orElseGet(() -> defaultTableLocation(session, tableMetadata.getTable()));
Transaction transaction = IcebergUtil.newCreateTableTransaction(this, tableMetadata, session, false, tableLocation, _ -> false);
Transaction transaction = IcebergUtil.newCreateTableTransaction(this, tableMetadata, session, false, tableLocation, _ -> false, GC_ENABLED_DEFAULT);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Transaction transaction = IcebergUtil.newCreateTableTransaction(this, tableMetadata, session, false, tableLocation, _ -> false, GC_ENABLED_DEFAULT);
Transaction transaction = IcebergUtil.newCreateTableTransaction(this, tableMetadata, session, false, tableLocation, _ -> false, true);

}

public static Transaction newCreateTableTransaction(TrinoCatalog catalog, ConnectorTableMetadata tableMetadata, ConnectorSession session, boolean replace, String tableLocation, Predicate<String> allowedExtraProperties)
public static Transaction newCreateTableTransaction(TrinoCatalog catalog, ConnectorTableMetadata tableMetadata, ConnectorSession session, boolean replace, String tableLocation, Predicate<String> allowedExtraProperties, boolean defaultNewTablesGcEnabled)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Please separate this long list of parameters on new lines for increased readability.

newDirectExecutorService(),
newDirectExecutorService());
newDirectExecutorService(),
GC_ENABLED_DEFAULT);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
GC_ENABLED_DEFAULT);
true);

newDirectExecutorService(),
newDirectExecutorService());
newDirectExecutorService(),
GC_ENABLED_DEFAULT);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
GC_ENABLED_DEFAULT);
true);

@findinpath
Copy link
Contributor

findinpath commented Nov 19, 2025

iceberg.default-new-tables-gc.enabled will set gc.enabled flag for new tables created by Trino.

Why should the Trino Iceberg connector be proactive in setting the gc.enabled field on the tables created?

Assuming that we find a good rationale on the question above, I suggest rather to handle the implementation of the config property rather differently.
Take hive.max-splits-per-second as a proof of concept


    @Nullable
    public Boolean getDefaultGarbageCollection()
    {
        return defaultGarbageCollection;
    }

    @Config("iceberg.default-garbage-collection")
    @ConfigDescription("Default value for Iceberg property gc.enabled when creating new tables")
    public IcebergConfig setDefaultGarbageCollection(Boolean defaultGarbageCollection)
    {
        this.defaultGarbageCollection = defaultGarbageCollection;
        return this;
    }

iceberg.object-store-layout.enabled - although applies for new tables, does not have new-table token in the property name.

iceberg.default-garbage-collection can be true, false or not specified.

@sopel39
Copy link
Member Author

sopel39 commented Nov 19, 2025

Why should the Trino Iceberg connector be proactive in setting the gc.enabled field on the tables created?

It's not proactive. It only sets it when the config toggle is different than GC_ENABLED

assertThat(ctasMetadata.properties().getOrDefault("gc.enabled", "true")).isEqualTo("true");

// collect data files and verify they are removed on drop
var ctasDataFiles = getAllDataFilesFromTableDirectory(ctasTable);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we avoid to use var ?

String ctasTableLocation = getTableLocation(ctasTable);
String ctasMetadataLocation = getLatestMetadataLocation(fileSystem, ctasTableLocation);
TableMetadata ctasMetadata = TableMetadataParser.read(FILE_IO_FACTORY.create(fileSystem), ctasMetadataLocation);
assertThat(ctasMetadata.properties().getOrDefault("gc.enabled", "")).isEqualTo("");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use isEmpty()


// If user doesn't set gc.enabled, we need to set it to defaultNewTablesGcEnabled value
if (!extraProperties.containsKey(GC_ENABLED) && defaultNewTablesGcEnabled != GC_ENABLED_DEFAULT) {
propertiesBuilder.put(GC_ENABLED, Boolean.toString(defaultNewTablesGcEnabled));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like a breaking change to expire_snapshots?

But if other engine disable the gc flag, it's also a problem to Trino, user not able to expire snapshots due a unsupported or exposed property

@sopel39
Copy link
Member Author

sopel39 commented Nov 19, 2025

After discussion with @raunaqmorarka and @ebyhr we figured that gc.enabled might not be the right approach, because it prevents both data delete on table drop and maintainable ops (e.g. expire snapshots).

Alternatives are:

  1. Trino specific property that prevents data delete on table drop in Trino only, but allows maintenance operations.
  2. Iceberg native property that prevents data delete on table drop across all engines, but allows maintenance operations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla-signed iceberg Iceberg connector

Development

Successfully merging this pull request may close these issues.

5 participants