-
Notifications
You must be signed in to change notification settings - Fork 3.2k
Add lakehouse connector #25347
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Add lakehouse connector #25347
Conversation
eef6588
to
09aae45
Compare
c0e4fa3
to
29f5c5e
Compare
Was waiting for this feature for a long time. Thanks @electrum for this contribution. Does the new connector support different partitioning approaches for different table formats? |
Yes, the partitioning and all other features are native per table format. |
* [](/object-storage/file-system-azure) | ||
* [](/object-storage/file-system-gcs) | ||
* [](/object-storage/file-system-s3) | ||
* [](/object-storage/file-system-hdfs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add local as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't have that for other connectors. Should we?
Love to see this. One question for you -- do you envision this supporting other catalogs besides HMS-compatible catalogs in the future? Currently I see it's only Thrift HMS and Glue. If you are envisioning this for HMS-compatible catalogs I might name this something more like |
The entire purpose of this is to support multiple table formats that share the same metastore. If you have an Iceberg catalog, then you can use the Iceberg connector directly. |
@electrum I am thinking of other catalogs supporting multiple table formats. I'm not sure if any others exist today, but you could envision Nessie or Unity Catalog adding support for other table formats beyond their native Iceberg/Delta, for example, though perhaps not the Hive format. |
@daniel-pan-trino - we definitely can discuss the details. Please reach out me in slack. |
@xkrogen We don't support any other such formats today, and it seems unlike any format would surpass Iceberg or Delta in popularity. More formats is more work for everyone and fragments the community, so it's in our interest to discourage that. |
I think you misunderstood my comment -- I'm not talking about new table formats. What I'm saying is that today we have, roughly, a situation that looks like this:
What you're currently proposing, AFAICT, is a way for users leveraging the Hive Metastore to use all 4 table formats seamlessly from a single connector. Currently, other catalogs are 1:1 with a table format, so this cross-table-format connector isn't necessary. (Though I think there may be an argument to be made that it still adds value from a semantics perspective -- most users that I work with would expect to think of their connector in terms of the catalog (Nessie/Unity/etc) rather than the table format (Iceberg/Delta/etc), and the 'lakehouse' connector helps move this closer to what users perceive.) What if Nessie were to add support for Delta, or Unity for Iceberg? Now presumably we would want a way to bring the same cross-table-format logic to these catalogs. Now that I think about it, I think the same is true for Starburst and the Galaxy catalog? In theory, in a lakehouse, the catalog and the table format should be decoupled choices. Right now we have table-format-specific connectors that have customizable catalogs, and this PR seems to be adding a catalog-specific connector (HMS-only) that supports customizable table formats. I am asking about whether we can move slightly further to a more generic lakehouse connector where the catalog is configurable, and the table formats supported are whichever ones are supported by the configured catalog. |
I understand what you're saying now. Unity uses the HMS interface, so we could make it work with this. Depending on how the integration works, we could support other catalogs for multiple connectors by adding an HMS shim. Do you have suggestions on a better name? We're not changing the fundamental nature of how connectors work in Trino. Each table format works differently and has its own syntax for table creation, etc. From a documentation perspective, users will need to refer to the individual connector page to see how to create tables, what properties apply, etc. |
Naming depends on whether or not we plan for this to HMS-specific long-term. "lakehouse" is a great name if we intend for it to truly become the one-stop-shop for all lakehouse catalogs and table formats. But if it's tied to HMS specifically, then that should be reflected in the name. Many people have lakehouses that don't use HMS!
This is what is a little concerning to me. Why should we use the HMS interface as the "standard" interface for what Trino considers lakehouse catalogs? Yes, it is widely used today, but I'm hesitant to say that we should bet on it being the long-term de facto standard for lakehouse catalogs. That said -- as long as we are conceptually aligned on the |
@Override | ||
public Optional<ConnectorTableHandle> applyPartitioning(ConnectorSession session, ConnectorTableHandle tableHandle, Optional<ConnectorPartitioningHandle> partitioningHandle, List<ColumnHandle> columns) | ||
{ | ||
return forHandle(tableHandle).applyPartitioning(session, tableHandle, partitioningHandle, columns); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we verify that tableHandle
and partitioningHandle
are for the same format like in getCommonPartitioningHandle
as well or is that not expected to happen because this API is applied only during table scanning? (if I understand the new API correctly)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at getCommonPartitioningHandle
again, the verification seems potentially incorrect, as it might be called for e.g., an Iceberg and Hive table, in which case they should be treated as having incompatible partitioning.
For this method, the partitioning always comes from the table, so it shouldn't be possible for them to mismatch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces a new Lakehouse connector that unifies the functionalities of the Hive, Iceberg, Delta Lake, and Hudi connectors. It includes the implementation of core connector classes, module bindings, session/table properties aggregation, updated documentation, and CI configuration changes.
Reviewed Changes
Copilot reviewed 33 out of 35 changed files in this pull request and generated no comments.
Show a summary per file
File | Description |
---|---|
plugin/trino-lakehouse/src/main/java/io/trino/plugin/lakehouse/LakehouseSplitManager.java | Implements split management using the underlying connectors. |
plugin/trino-lakehouse/src/main/java/io/trino/plugin/lakehouse/LakehouseSessionProperties.java | Aggregates session properties from Hive, Iceberg, Delta Lake, and Hudi connectors. |
plugin/trino-lakehouse/src/main/java/io/trino/plugin/lakehouse/LakehousePlugin.java | Registers the Lakehouse connector via the plugin interface. |
plugin/trino-lakehouse/src/main/java/io/trino/plugin/lakehouse/LakehousePageSourceProviderFactory.java | Provides page source support based on table handle types. |
plugin/trino-lakehouse/src/main/java/io/trino/plugin/lakehouse/LakehousePageSinkProvider.java | Implements page sink support for different insert, merge, and execute handles. |
plugin/trino-lakehouse/src/main/java/io/trino/plugin/lakehouse/LakehouseNodePartitioningProvider.java | Offers bucket mapping and partitioning support by delegating to underlying providers. |
plugin/trino-lakehouse/src/main/java/io/trino/plugin/lakehouse/LakehouseModule.java | Configures dependency injection for core Lakehouse connector components. |
plugin/trino-lakehouse/src/main/java/io/trino/plugin/lakehouse/LakehouseIcebergModule.java | Sets up bindings specific to Iceberg integration. |
plugin/trino-lakehouse/src/main/java/io/trino/plugin/lakehouse/LakehouseHudiModule.java | Sets up bindings specific to Hudi integration. |
plugin/trino-lakehouse/src/main/java/io/trino/plugin/lakehouse/LakehouseHiveModule.java | Configures Hive-specific bindings and integrations. |
plugin/trino-lakehouse/src/main/java/io/trino/plugin/lakehouse/LakehouseFileSystemModule.java | Integrates file system support for the connector. |
plugin/trino-lakehouse/src/main/java/io/trino/plugin/lakehouse/LakehouseDeltaModule.java | Configures Delta Lake-specific components and scheduling. |
plugin/trino-lakehouse/src/main/java/io/trino/plugin/lakehouse/LakehouseConnectorFactory.java | Bootstraps the connector and creates its instance. |
plugin/trino-lakehouse/src/main/java/io/trino/plugin/lakehouse/LakehouseConnector.java | Implements the main connector logic including lifecycle and transaction management. |
plugin/trino-lakehouse/src/main/java/io/trino/plugin/lakehouse/LakehouseConfig.java | Provides configuration options (e.g., default table type) for the connector. |
docs/src/main/sphinx/connector/lakehouse.md | Documents connector configuration and examples. |
docs/src/main/sphinx/connector.md | Adds an entry for the Lakehouse connector. |
.github/workflows/ci.yml | Updates CI to include the Lakehouse connector in test modules. |
Files not reviewed (2)
- core/trino-server/src/main/provisio/trino.xml: Language not supported
- plugin/trino-lakehouse/pom.xml: Language not supported
Comments suppressed due to low confidence (1)
plugin/trino-lakehouse/src/main/java/io/trino/plugin/lakehouse/LakehouseConnectorFactory.java:56
- [nitpick] Consider renaming the use of '_' as the try-with-resources variable name to a more descriptive identifier (e.g., threadContextClassLoader) to avoid potential conflicts with reserved identifiers in newer Java versions.
try (ThreadContextClassLoader _ = new ThreadContextClassLoader(getClass().getClassLoader())) {
This pull request has gone a while without any activity. Ask for help on #core-dev on Trino slack. |
Description
The Lakehouse connector unifies the Hive, Iceberg, Delta Lake and Hudi connectors.
This doesn't yet implement callable procedures or table procedures.
Release notes
(x) Release notes are required, with the following suggested text: