Skip to content

DeltaLake support optimize output metrics in result#29040

Open
Max-Cheng wants to merge 2 commits intotrinodb:masterfrom
Max-Cheng:deltalake_optimize_metrics
Open

DeltaLake support optimize output metrics in result#29040
Max-Cheng wants to merge 2 commits intotrinodb:masterfrom
Max-Cheng:deltalake_optimize_metrics

Conversation

@Max-Cheng
Copy link
Copy Markdown
Contributor

@Max-Cheng Max-Cheng commented Apr 8, 2026

Description

Allow distributed table procedures to output metrics in result.
DeltaLake OPTIMIZE procedure returns the output below:

trino> ALTER TABLE deltalake.tpch.region EXECUTE optimize;
ALTER TABLE EXECUTE
        metric_name         | metric_value
----------------------------+--------------
 rewritten_data_files_count |            2
 removed_delete_files_count |            0
 added_data_files_count     |            1
(3 rows)

Release notes

## Delta Lake
* Add support for execution metrics while running the `optimize` procedure. ({issue}`28992`)

@cla-bot
Copy link
Copy Markdown

cla-bot bot commented Apr 8, 2026

Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Ray Cheng.
This is most likely caused by a git client misconfiguration; please make sure to:

  1. check if your git client is configured with an email to sign commits git config --list | grep email
  2. If not, set it up using git config --global user.email email@example.com
  3. Make sure that the git commit email is configured in your GitHub account settings, see https://github.com/settings/emails

@github-actions github-actions bot added docs delta-lake Delta Lake connector labels Apr 8, 2026
@Max-Cheng Max-Cheng changed the title DeltaLake support optimzie output metrics in result DeltaLake support optimize output metrics in result Apr 8, 2026
@Max-Cheng Max-Cheng force-pushed the deltalake_optimize_metrics branch from ae8ceb4 to 1bbdd9b Compare April 8, 2026 07:25
@cla-bot cla-bot bot added the cla-signed label Apr 8, 2026
@Max-Cheng Max-Cheng marked this pull request as ready for review April 8, 2026 07:29
@Max-Cheng Max-Cheng force-pushed the deltalake_optimize_metrics branch from 1bbdd9b to c6e9dfd Compare April 8, 2026 07:29
@Max-Cheng
Copy link
Copy Markdown
Contributor Author

@ebyhr OMG your response speed is so fast!

@Max-Cheng Max-Cheng force-pushed the deltalake_optimize_metrics branch from c6e9dfd to c544c15 Compare April 8, 2026 08:26
@Max-Cheng Max-Cheng force-pushed the deltalake_optimize_metrics branch from c544c15 to 1eebb67 Compare April 8, 2026 10:56
}

private void finishOptimize(ConnectorSession session, DeltaLakeTableExecuteHandle executeHandle, Collection<Slice> fragments, List<Object> splitSourceInfo)
private Map<String, Long> finishOptimize(ConnectorSession session, DeltaLakeTableExecuteHandle executeHandle, Collection<Slice> fragments, List<Object> splitSourceInfo)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I'd recommend rather returning OptimizeResult and doing the toMap in the calling context instead.
If you take the suggestion, pls do a preparatory commit as well for IcebergMetadata

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's avoid touching unrelated modules in this PR. There is no need to modify IcebergMetadata in my opinion.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ebyhr Maybe it's another task?

for (int i = 0; i < 3; i++) {
Set<String> initialFiles = getActiveFiles(tableName);
computeActual("ALTER TABLE " + tableName + " EXECUTE OPTIMIZE");
computeActual(withSingleWriterPerTask(getSession()), "ALTER TABLE " + tableName + " EXECUTE OPTIMIZE");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's rather do a preparatory commit that makes use of withSingleWriterPerTask to explain that we're aiming for stable results.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but, it seems an un-related to the change, is it required by the current(pr's) change?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chenjian2664 @findinpath yep, this is an un-related change: to avoid flaky test results, I’ll split these commits into separate ones.

Copy link
Copy Markdown
Contributor

@findinpath findinpath left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM % nits

metric_name | metric_value
----------------------------+--------------
rewritten_data_files_count | 2
removed_delete_files_count | 0
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can remove it if it is guaranteed to always be 0

for (int i = 0; i < 3; i++) {
Set<String> initialFiles = getActiveFiles(tableName);
computeActual("ALTER TABLE " + tableName + " EXECUTE OPTIMIZE");
computeActual(withSingleWriterPerTask(getSession()), "ALTER TABLE " + tableName + " EXECUTE OPTIMIZE");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but, it seems an un-related to the change, is it required by the current(pr's) change?

MaterializedResult optimizeResult = computeActual(withSingleWriterPerTask(getSession()), "ALTER TABLE " + tableName + " EXECUTE OPTIMIZE");

Map<String, Long> metrics = optimizeResult.getMaterializedRows().stream()
.collect(ImmutableMap.toImmutableMap(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

static import

@Max-Cheng Max-Cheng force-pushed the deltalake_optimize_metrics branch from 1eebb67 to 3abfa15 Compare April 9, 2026 10:57
@Max-Cheng Max-Cheng force-pushed the deltalake_optimize_metrics branch from 3abfa15 to 5665182 Compare April 9, 2026 10:59
return new OptimizeResult(scannedDataFiles.size(), filesToDelete.size(), dataFileInfos.size());
}

private record OptimizeResult(long rewrittenDataFiles, long removedDeleteFiles, long addedDataFiles)
Copy link
Copy Markdown
Member

@ebyhr ebyhr Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need this record in this connector. Please remove.

Iceberg connector uses a record class because there are 2 return in finishOptimize method.

}

@Test
public void testOptimizeWithDeletionVectors()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BaseDeltaLakeConnectorSmokeTest is used for running tests on several storages. Please consider reverting changes in this class, and updating TestDeltaLakeConnectorTest instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Development

Successfully merging this pull request may close these issues.

Return execution metrics while running the OPTIMIZE command in Delta Lake

4 participants