Skip to content

Add lakehouse connector #25347

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

Add lakehouse connector #25347

wants to merge 1 commit into from

Conversation

electrum
Copy link
Member

@electrum electrum commented Mar 18, 2025

Description

The Lakehouse connector unifies the Hive, Iceberg, Delta Lake and Hudi connectors.

This doesn't yet implement callable procedures or table procedures.

Release notes

(x) Release notes are required, with the following suggested text:

## General
* Add Lakehouse connector. ({issue}`25347`)

@cla-bot cla-bot bot added the cla-signed label Mar 18, 2025
@github-actions github-actions bot added the docs label Mar 18, 2025
@electrum electrum requested review from ebyhr and dain March 19, 2025 03:54
@electrum electrum force-pushed the lakehouse branch 3 times, most recently from eef6588 to 09aae45 Compare March 19, 2025 06:26
@github-actions github-actions bot added iceberg Iceberg connector hive Hive connector labels Mar 19, 2025
@electrum electrum force-pushed the lakehouse branch 3 times, most recently from c0e4fa3 to 29f5c5e Compare March 19, 2025 06:54
@sajjoseph
Copy link
Contributor

Was waiting for this feature for a long time. Thanks @electrum for this contribution.
This will eliminate the need to maintain multiple catalogs based on different table formats.

Does the new connector support different partitioning approaches for different table formats?
In other words, can we supply hidden partitioning for ICEBERG tables?

@electrum
Copy link
Member Author

Yes, the partitioning and all other features are native per table format.

* [](/object-storage/file-system-azure)
* [](/object-storage/file-system-gcs)
* [](/object-storage/file-system-s3)
* [](/object-storage/file-system-hdfs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add local as well

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have that for other connectors. Should we?

@xkrogen
Copy link
Member

xkrogen commented Mar 24, 2025

Love to see this.

One question for you -- do you envision this supporting other catalogs besides HMS-compatible catalogs in the future? Currently I see it's only Thrift HMS and Glue. If you are envisioning this for HMS-compatible catalogs I might name this something more like hive-lakehouse/hms-lakehouse. Lots of Lakehouses don't use HMS these days :)

@electrum
Copy link
Member Author

electrum commented Mar 25, 2025

The entire purpose of this is to support multiple table formats that share the same metastore. If you have an Iceberg catalog, then you can use the Iceberg connector directly.

@xkrogen
Copy link
Member

xkrogen commented Mar 25, 2025

@electrum I am thinking of other catalogs supporting multiple table formats. I'm not sure if any others exist today, but you could envision Nessie or Unity Catalog adding support for other table formats beyond their native Iceberg/Delta, for example, though perhaps not the Hive format.

@sajjoseph
Copy link
Contributor

@daniel-pan-trino - we definitely can discuss the details. Please reach out me in slack.
Thanks!

@electrum
Copy link
Member Author

@xkrogen We don't support any other such formats today, and it seems unlike any format would surpass Iceberg or Delta in popularity. More formats is more work for everyone and fragments the community, so it's in our interest to discourage that.

@xkrogen
Copy link
Member

xkrogen commented Mar 28, 2025

@xkrogen We don't support any other such formats today, and it seems unlike any format would surpass Iceberg or Delta in popularity. More formats is more work for everyone and fragments the community, so it's in our interest to discourage that.

I think you misunderstood my comment -- I'm not talking about new table formats. What I'm saying is that today we have, roughly, a situation that looks like this:

  1. There are 3 relevant table formats (Iceberg/Delta/Hudi), or 4 if you count the Hive "table format"
  2. HMS can be used as a catalog that "supports" all 4 of these formats
  3. Other catalogs (AFAIK) only support their native format, i.e. Nessie supports Iceberg, Unity supports Delta, Hudi is its own catalog, etc.

What you're currently proposing, AFAICT, is a way for users leveraging the Hive Metastore to use all 4 table formats seamlessly from a single connector.

Currently, other catalogs are 1:1 with a table format, so this cross-table-format connector isn't necessary. (Though I think there may be an argument to be made that it still adds value from a semantics perspective -- most users that I work with would expect to think of their connector in terms of the catalog (Nessie/Unity/etc) rather than the table format (Iceberg/Delta/etc), and the 'lakehouse' connector helps move this closer to what users perceive.)

What if Nessie were to add support for Delta, or Unity for Iceberg? Now presumably we would want a way to bring the same cross-table-format logic to these catalogs. Now that I think about it, I think the same is true for Starburst and the Galaxy catalog?

In theory, in a lakehouse, the catalog and the table format should be decoupled choices. Right now we have table-format-specific connectors that have customizable catalogs, and this PR seems to be adding a catalog-specific connector (HMS-only) that supports customizable table formats. I am asking about whether we can move slightly further to a more generic lakehouse connector where the catalog is configurable, and the table formats supported are whichever ones are supported by the configured catalog.

@electrum
Copy link
Member Author

I understand what you're saying now. Unity uses the HMS interface, so we could make it work with this. Depending on how the integration works, we could support other catalogs for multiple connectors by adding an HMS shim.

Do you have suggestions on a better name? We're not changing the fundamental nature of how connectors work in Trino. Each table format works differently and has its own syntax for table creation, etc.

From a documentation perspective, users will need to refer to the individual connector page to see how to create tables, what properties apply, etc.

@xkrogen
Copy link
Member

xkrogen commented Mar 28, 2025

Do you have suggestions on a better name? We're not changing the fundamental nature of how connectors work in Trino. Each table format works differently and has its own syntax for table creation, etc.

Naming depends on whether or not we plan for this to HMS-specific long-term. "lakehouse" is a great name if we intend for it to truly become the one-stop-shop for all lakehouse catalogs and table formats. But if it's tied to HMS specifically, then that should be reflected in the name. Many people have lakehouses that don't use HMS!

I understand what you're saying now. Unity uses the HMS interface, so we could make it work with this. Depending on how the integration works, we could support other catalogs for multiple connectors by adding an HMS shim.

This is what is a little concerning to me. Why should we use the HMS interface as the "standard" interface for what Trino considers lakehouse catalogs? Yes, it is widely used today, but I'm hesitant to say that we should bet on it being the long-term de facto standard for lakehouse catalogs.

That said -- as long as we are conceptually aligned on the lakehouse connector supporting a variety of catalogs, regardless of the underlying implementation (HMS API shims vs a more native approach), then we can of course evolve this over time.

@Override
public Optional<ConnectorTableHandle> applyPartitioning(ConnectorSession session, ConnectorTableHandle tableHandle, Optional<ConnectorPartitioningHandle> partitioningHandle, List<ColumnHandle> columns)
{
return forHandle(tableHandle).applyPartitioning(session, tableHandle, partitioningHandle, columns);
Copy link
Member

@weijiii weijiii Mar 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we verify that tableHandle and partitioningHandle are for the same format like in getCommonPartitioningHandle as well or is that not expected to happen because this API is applied only during table scanning? (if I understand the new API correctly)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at getCommonPartitioningHandle again, the verification seems potentially incorrect, as it might be called for e.g., an Iceberg and Hive table, in which case they should be treated as having incompatible partitioning.

For this method, the partitioning always comes from the table, so it shouldn't be possible for them to mismatch.

Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces a new Lakehouse connector that unifies the functionalities of the Hive, Iceberg, Delta Lake, and Hudi connectors. It includes the implementation of core connector classes, module bindings, session/table properties aggregation, updated documentation, and CI configuration changes.

Reviewed Changes

Copilot reviewed 33 out of 35 changed files in this pull request and generated no comments.

Show a summary per file
File Description
plugin/trino-lakehouse/src/main/java/io/trino/plugin/lakehouse/LakehouseSplitManager.java Implements split management using the underlying connectors.
plugin/trino-lakehouse/src/main/java/io/trino/plugin/lakehouse/LakehouseSessionProperties.java Aggregates session properties from Hive, Iceberg, Delta Lake, and Hudi connectors.
plugin/trino-lakehouse/src/main/java/io/trino/plugin/lakehouse/LakehousePlugin.java Registers the Lakehouse connector via the plugin interface.
plugin/trino-lakehouse/src/main/java/io/trino/plugin/lakehouse/LakehousePageSourceProviderFactory.java Provides page source support based on table handle types.
plugin/trino-lakehouse/src/main/java/io/trino/plugin/lakehouse/LakehousePageSinkProvider.java Implements page sink support for different insert, merge, and execute handles.
plugin/trino-lakehouse/src/main/java/io/trino/plugin/lakehouse/LakehouseNodePartitioningProvider.java Offers bucket mapping and partitioning support by delegating to underlying providers.
plugin/trino-lakehouse/src/main/java/io/trino/plugin/lakehouse/LakehouseModule.java Configures dependency injection for core Lakehouse connector components.
plugin/trino-lakehouse/src/main/java/io/trino/plugin/lakehouse/LakehouseIcebergModule.java Sets up bindings specific to Iceberg integration.
plugin/trino-lakehouse/src/main/java/io/trino/plugin/lakehouse/LakehouseHudiModule.java Sets up bindings specific to Hudi integration.
plugin/trino-lakehouse/src/main/java/io/trino/plugin/lakehouse/LakehouseHiveModule.java Configures Hive-specific bindings and integrations.
plugin/trino-lakehouse/src/main/java/io/trino/plugin/lakehouse/LakehouseFileSystemModule.java Integrates file system support for the connector.
plugin/trino-lakehouse/src/main/java/io/trino/plugin/lakehouse/LakehouseDeltaModule.java Configures Delta Lake-specific components and scheduling.
plugin/trino-lakehouse/src/main/java/io/trino/plugin/lakehouse/LakehouseConnectorFactory.java Bootstraps the connector and creates its instance.
plugin/trino-lakehouse/src/main/java/io/trino/plugin/lakehouse/LakehouseConnector.java Implements the main connector logic including lifecycle and transaction management.
plugin/trino-lakehouse/src/main/java/io/trino/plugin/lakehouse/LakehouseConfig.java Provides configuration options (e.g., default table type) for the connector.
docs/src/main/sphinx/connector/lakehouse.md Documents connector configuration and examples.
docs/src/main/sphinx/connector.md Adds an entry for the Lakehouse connector.
.github/workflows/ci.yml Updates CI to include the Lakehouse connector in test modules.
Files not reviewed (2)
  • core/trino-server/src/main/provisio/trino.xml: Language not supported
  • plugin/trino-lakehouse/pom.xml: Language not supported
Comments suppressed due to low confidence (1)

plugin/trino-lakehouse/src/main/java/io/trino/plugin/lakehouse/LakehouseConnectorFactory.java:56

  • [nitpick] Consider renaming the use of '_' as the try-with-resources variable name to a more descriptive identifier (e.g., threadContextClassLoader) to avoid potential conflicts with reserved identifiers in newer Java versions.
try (ThreadContextClassLoader _ = new ThreadContextClassLoader(getClass().getClassLoader())) {

Copy link

This pull request has gone a while without any activity. Ask for help on #core-dev on Trino slack.

@github-actions github-actions bot added the stale label Apr 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla-signed docs hive Hive connector iceberg Iceberg connector stale
Development

Successfully merging this pull request may close these issues.

5 participants