Skip to content

Add lakehouse connector #25347

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jun 25, 2025
Merged

Add lakehouse connector #25347

merged 1 commit into from
Jun 25, 2025

Conversation

electrum
Copy link
Member

@electrum electrum commented Mar 18, 2025

Description

The Lakehouse connector unifies the Hive, Iceberg, Delta Lake and Hudi connectors.

This doesn't yet implement callable procedures or table procedures.

Release notes

(x) Release notes are required, with the following suggested text:

## General
* Add Lakehouse connector. ({issue}`25347`)

@cla-bot cla-bot bot added the cla-signed label Mar 18, 2025
@github-actions github-actions bot added the docs label Mar 18, 2025
@electrum electrum requested review from ebyhr and dain March 19, 2025 03:54
@electrum electrum force-pushed the lakehouse branch 3 times, most recently from eef6588 to 09aae45 Compare March 19, 2025 06:26
@github-actions github-actions bot added iceberg Iceberg connector hive Hive connector labels Mar 19, 2025
@electrum electrum force-pushed the lakehouse branch 3 times, most recently from c0e4fa3 to 29f5c5e Compare March 19, 2025 06:54
@sajjoseph
Copy link
Contributor

Was waiting for this feature for a long time. Thanks @electrum for this contribution.
This will eliminate the need to maintain multiple catalogs based on different table formats.

Does the new connector support different partitioning approaches for different table formats?
In other words, can we supply hidden partitioning for ICEBERG tables?

@electrum
Copy link
Member Author

Yes, the partitioning and all other features are native per table format.

@xkrogen
Copy link
Member

xkrogen commented Mar 24, 2025

Love to see this.

One question for you -- do you envision this supporting other catalogs besides HMS-compatible catalogs in the future? Currently I see it's only Thrift HMS and Glue. If you are envisioning this for HMS-compatible catalogs I might name this something more like hive-lakehouse/hms-lakehouse. Lots of Lakehouses don't use HMS these days :)

@electrum
Copy link
Member Author

electrum commented Mar 25, 2025

The entire purpose of this is to support multiple table formats that share the same metastore. If you have an Iceberg catalog, then you can use the Iceberg connector directly.

@xkrogen
Copy link
Member

xkrogen commented Mar 25, 2025

@electrum I am thinking of other catalogs supporting multiple table formats. I'm not sure if any others exist today, but you could envision Nessie or Unity Catalog adding support for other table formats beyond their native Iceberg/Delta, for example, though perhaps not the Hive format.

@sajjoseph
Copy link
Contributor

@daniel-pan-trino - we definitely can discuss the details. Please reach out me in slack.
Thanks!

@electrum
Copy link
Member Author

@xkrogen We don't support any other such formats today, and it seems unlike any format would surpass Iceberg or Delta in popularity. More formats is more work for everyone and fragments the community, so it's in our interest to discourage that.

@xkrogen
Copy link
Member

xkrogen commented Mar 28, 2025

@xkrogen We don't support any other such formats today, and it seems unlike any format would surpass Iceberg or Delta in popularity. More formats is more work for everyone and fragments the community, so it's in our interest to discourage that.

I think you misunderstood my comment -- I'm not talking about new table formats. What I'm saying is that today we have, roughly, a situation that looks like this:

  1. There are 3 relevant table formats (Iceberg/Delta/Hudi), or 4 if you count the Hive "table format"
  2. HMS can be used as a catalog that "supports" all 4 of these formats
  3. Other catalogs (AFAIK) only support their native format, i.e. Nessie supports Iceberg, Unity supports Delta, Hudi is its own catalog, etc.

What you're currently proposing, AFAICT, is a way for users leveraging the Hive Metastore to use all 4 table formats seamlessly from a single connector.

Currently, other catalogs are 1:1 with a table format, so this cross-table-format connector isn't necessary. (Though I think there may be an argument to be made that it still adds value from a semantics perspective -- most users that I work with would expect to think of their connector in terms of the catalog (Nessie/Unity/etc) rather than the table format (Iceberg/Delta/etc), and the 'lakehouse' connector helps move this closer to what users perceive.)

What if Nessie were to add support for Delta, or Unity for Iceberg? Now presumably we would want a way to bring the same cross-table-format logic to these catalogs. Now that I think about it, I think the same is true for Starburst and the Galaxy catalog?

In theory, in a lakehouse, the catalog and the table format should be decoupled choices. Right now we have table-format-specific connectors that have customizable catalogs, and this PR seems to be adding a catalog-specific connector (HMS-only) that supports customizable table formats. I am asking about whether we can move slightly further to a more generic lakehouse connector where the catalog is configurable, and the table formats supported are whichever ones are supported by the configured catalog.

@electrum
Copy link
Member Author

I understand what you're saying now. Unity uses the HMS interface, so we could make it work with this. Depending on how the integration works, we could support other catalogs for multiple connectors by adding an HMS shim.

Do you have suggestions on a better name? We're not changing the fundamental nature of how connectors work in Trino. Each table format works differently and has its own syntax for table creation, etc.

From a documentation perspective, users will need to refer to the individual connector page to see how to create tables, what properties apply, etc.

@xkrogen
Copy link
Member

xkrogen commented Mar 28, 2025

Do you have suggestions on a better name? We're not changing the fundamental nature of how connectors work in Trino. Each table format works differently and has its own syntax for table creation, etc.

Naming depends on whether or not we plan for this to HMS-specific long-term. "lakehouse" is a great name if we intend for it to truly become the one-stop-shop for all lakehouse catalogs and table formats. But if it's tied to HMS specifically, then that should be reflected in the name. Many people have lakehouses that don't use HMS!

I understand what you're saying now. Unity uses the HMS interface, so we could make it work with this. Depending on how the integration works, we could support other catalogs for multiple connectors by adding an HMS shim.

This is what is a little concerning to me. Why should we use the HMS interface as the "standard" interface for what Trino considers lakehouse catalogs? Yes, it is widely used today, but I'm hesitant to say that we should bet on it being the long-term de facto standard for lakehouse catalogs.

That said -- as long as we are conceptually aligned on the lakehouse connector supporting a variety of catalogs, regardless of the underlying implementation (HMS API shims vs a more native approach), then we can of course evolve this over time.

@Override
public Optional<ConnectorTableHandle> applyPartitioning(ConnectorSession session, ConnectorTableHandle tableHandle, Optional<ConnectorPartitioningHandle> partitioningHandle, List<ColumnHandle> columns)
{
return forHandle(tableHandle).applyPartitioning(session, tableHandle, partitioningHandle, columns);
Copy link
Member

@weijiii weijiii Mar 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we verify that tableHandle and partitioningHandle are for the same format like in getCommonPartitioningHandle as well or is that not expected to happen because this API is applied only during table scanning? (if I understand the new API correctly)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at getCommonPartitioningHandle again, the verification seems potentially incorrect, as it might be called for e.g., an Iceberg and Hive table, in which case they should be treated as having incompatible partitioning.

For this method, the partitioning always comes from the table, so it shouldn't be possible for them to mismatch.

Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces a new Lakehouse connector that unifies the functionalities of the Hive, Iceberg, Delta Lake, and Hudi connectors. It includes the implementation of core connector classes, module bindings, session/table properties aggregation, updated documentation, and CI configuration changes.

Reviewed Changes

Copilot reviewed 33 out of 35 changed files in this pull request and generated no comments.

Show a summary per file
File Description
plugin/trino-lakehouse/src/main/java/io/trino/plugin/lakehouse/LakehouseSplitManager.java Implements split management using the underlying connectors.
plugin/trino-lakehouse/src/main/java/io/trino/plugin/lakehouse/LakehouseSessionProperties.java Aggregates session properties from Hive, Iceberg, Delta Lake, and Hudi connectors.
plugin/trino-lakehouse/src/main/java/io/trino/plugin/lakehouse/LakehousePlugin.java Registers the Lakehouse connector via the plugin interface.
plugin/trino-lakehouse/src/main/java/io/trino/plugin/lakehouse/LakehousePageSourceProviderFactory.java Provides page source support based on table handle types.
plugin/trino-lakehouse/src/main/java/io/trino/plugin/lakehouse/LakehousePageSinkProvider.java Implements page sink support for different insert, merge, and execute handles.
plugin/trino-lakehouse/src/main/java/io/trino/plugin/lakehouse/LakehouseNodePartitioningProvider.java Offers bucket mapping and partitioning support by delegating to underlying providers.
plugin/trino-lakehouse/src/main/java/io/trino/plugin/lakehouse/LakehouseModule.java Configures dependency injection for core Lakehouse connector components.
plugin/trino-lakehouse/src/main/java/io/trino/plugin/lakehouse/LakehouseIcebergModule.java Sets up bindings specific to Iceberg integration.
plugin/trino-lakehouse/src/main/java/io/trino/plugin/lakehouse/LakehouseHudiModule.java Sets up bindings specific to Hudi integration.
plugin/trino-lakehouse/src/main/java/io/trino/plugin/lakehouse/LakehouseHiveModule.java Configures Hive-specific bindings and integrations.
plugin/trino-lakehouse/src/main/java/io/trino/plugin/lakehouse/LakehouseFileSystemModule.java Integrates file system support for the connector.
plugin/trino-lakehouse/src/main/java/io/trino/plugin/lakehouse/LakehouseDeltaModule.java Configures Delta Lake-specific components and scheduling.
plugin/trino-lakehouse/src/main/java/io/trino/plugin/lakehouse/LakehouseConnectorFactory.java Bootstraps the connector and creates its instance.
plugin/trino-lakehouse/src/main/java/io/trino/plugin/lakehouse/LakehouseConnector.java Implements the main connector logic including lifecycle and transaction management.
plugin/trino-lakehouse/src/main/java/io/trino/plugin/lakehouse/LakehouseConfig.java Provides configuration options (e.g., default table type) for the connector.
docs/src/main/sphinx/connector/lakehouse.md Documents connector configuration and examples.
docs/src/main/sphinx/connector.md Adds an entry for the Lakehouse connector.
.github/workflows/ci.yml Updates CI to include the Lakehouse connector in test modules.
Files not reviewed (2)
  • core/trino-server/src/main/provisio/trino.xml: Language not supported
  • plugin/trino-lakehouse/pom.xml: Language not supported
Comments suppressed due to low confidence (1)

plugin/trino-lakehouse/src/main/java/io/trino/plugin/lakehouse/LakehouseConnectorFactory.java:56

  • [nitpick] Consider renaming the use of '_' as the try-with-resources variable name to a more descriptive identifier (e.g., threadContextClassLoader) to avoid potential conflicts with reserved identifiers in newer Java versions.
try (ThreadContextClassLoader _ = new ThreadContextClassLoader(getClass().getClassLoader())) {

Copy link

This pull request has gone a while without any activity. Ask for help on #core-dev on Trino slack.

@github-actions github-actions bot added the stale label Apr 21, 2025
@dain
Copy link
Member

dain commented May 5, 2025

I understand what you're saying now. Unity uses the HMS interface, so we could make it work with this. Depending on how the integration works, we could support other catalogs for multiple connectors by adding an HMS shim.

This is what is a little concerning to me. Why should we use the HMS interface as the "standard" interface for what Trino considers lakehouse catalogs? Yes, it is widely used today, but I'm hesitant to say that we should bet on it being the long-term de facto standard for lakehouse catalogs.

This isn't really about HMS per se, but is really about how table management works in environments like HMS. There are assumptions baked in to Hive-like systems that effect how we deal with tables, views, functions, and so on. This connector is based on the existing HMS centric connectors, and should be able to handle anything that is works this way. Today, this means HMS and Glue, and I expect this could be extended to Unity using the Unity REST apis. But, it would now work with Iceberg REST, because Iceberg REST catalogs just work a different way, and are single table format today.

Anyway, this is a long way of saying "this connector is what people expect when they want to read data from a lake"

@github-actions github-actions bot removed the stale label May 6, 2025
Copy link

This pull request has gone a while without any activity. Ask for help on #core-dev on Trino slack.

@github-actions github-actions bot added the stale label May 28, 2025
Copy link

Closing this pull request, as it has been stale for six weeks. Feel free to re-open at any time.

@github-actions github-actions bot closed this Jun 18, 2025
@mosabua
Copy link
Member

mosabua commented Jun 19, 2025

We should still add this .. what do you think @electrum and @dain ?

@electrum electrum reopened this Jun 19, 2025
@sajjoseph
Copy link
Contributor

I incorporated this connector in my local build and it is working. Thanks @electrum for this special connector. It helps to avoid creating different connectors based on table format against the same data source.

Is there a way to filter the table list based on table format? (For example, list only iceberg tables under the Lakehouse catalog)? The use case is to track the adoption rate of new table formats.

Minor changes are needed to get the code working in 476 though.

LakehouseIcebergModule.java
LakehouseMetadata.java
LakehouseSessioinProperties.java --> (Optional)

@github-actions github-actions bot removed the stale label Jun 19, 2025
@mosabua mosabua added the stale-ignore Use this label on PRs that should be ignored by the stale bot so they are not flagged or closed. label Jun 20, 2025
@mosabua
Copy link
Member

mosabua commented Jun 20, 2025

We want to bring this connector in since it is a great and useful feature to have. I assume @electrum @dain and others can continue to drive this so I added the stale-ignore label to avoid noise.

@electrum electrum force-pushed the lakehouse branch 6 times, most recently from cba2855 to bb3fcde Compare June 25, 2025 00:01
Comment on lines 3 to 8
The Lakehouse connector combines the features of the
[Hive](/connector/hive), [Iceberg](/connector/iceberg),
[Delta Lake](/connector/delta-lake), and [Hudi](/connector/hudi)
connectors into a single connector. It allows you to query or write
to data stored in multiple table types (also known as table formats)
that all share the same file system and metastore service.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should not define this in terms of the old system and instead we should just state the whole thing. So something like this (GenAI):

The Trino Lakehouse connector provides a unified way to interact with data stored in various table formats across different storage systems and metastore services. This single connector allows you to query and write data seamlessly, regardless of whether it's in Iceberg, Delta Lake, or Hudi table formats, or traditional Hive tables.

This connector offers flexible connectivity to popular metastore services like AWS Glue and Hive Metastore. For data storage, it supports a wide range of options including cloud storage services such as AWS S3, S3-compatible storage, Google Cloud Storage (GCS), and Azure Blob Storage, as well as HDFS installations.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this, but it's important for users to understand that this actually does combine the connectors, as all of the config properties, session properties, table properties, behavior, etc., is that of the underlying connectors, and they need to read the appropriate documentation for whichever features they are using. Thoughts?

@electrum electrum merged commit 4def327 into trinodb:master Jun 25, 2025
189 of 191 checks passed
@electrum electrum deleted the lakehouse branch June 25, 2025 23:59
@github-actions github-actions bot added this to the 477 milestone Jun 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla-signed docs hive Hive connector iceberg Iceberg connector stale-ignore Use this label on PRs that should be ignored by the stale bot so they are not flagged or closed.
Development

Successfully merging this pull request may close these issues.

6 participants