-
Notifications
You must be signed in to change notification settings - Fork 3.2k
Add lakehouse connector #25347
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add lakehouse connector #25347
Conversation
eef6588
to
09aae45
Compare
c0e4fa3
to
29f5c5e
Compare
Was waiting for this feature for a long time. Thanks @electrum for this contribution. Does the new connector support different partitioning approaches for different table formats? |
Yes, the partitioning and all other features are native per table format. |
Love to see this. One question for you -- do you envision this supporting other catalogs besides HMS-compatible catalogs in the future? Currently I see it's only Thrift HMS and Glue. If you are envisioning this for HMS-compatible catalogs I might name this something more like |
The entire purpose of this is to support multiple table formats that share the same metastore. If you have an Iceberg catalog, then you can use the Iceberg connector directly. |
@electrum I am thinking of other catalogs supporting multiple table formats. I'm not sure if any others exist today, but you could envision Nessie or Unity Catalog adding support for other table formats beyond their native Iceberg/Delta, for example, though perhaps not the Hive format. |
@daniel-pan-trino - we definitely can discuss the details. Please reach out me in slack. |
@xkrogen We don't support any other such formats today, and it seems unlike any format would surpass Iceberg or Delta in popularity. More formats is more work for everyone and fragments the community, so it's in our interest to discourage that. |
I think you misunderstood my comment -- I'm not talking about new table formats. What I'm saying is that today we have, roughly, a situation that looks like this:
What you're currently proposing, AFAICT, is a way for users leveraging the Hive Metastore to use all 4 table formats seamlessly from a single connector. Currently, other catalogs are 1:1 with a table format, so this cross-table-format connector isn't necessary. (Though I think there may be an argument to be made that it still adds value from a semantics perspective -- most users that I work with would expect to think of their connector in terms of the catalog (Nessie/Unity/etc) rather than the table format (Iceberg/Delta/etc), and the 'lakehouse' connector helps move this closer to what users perceive.) What if Nessie were to add support for Delta, or Unity for Iceberg? Now presumably we would want a way to bring the same cross-table-format logic to these catalogs. Now that I think about it, I think the same is true for Starburst and the Galaxy catalog? In theory, in a lakehouse, the catalog and the table format should be decoupled choices. Right now we have table-format-specific connectors that have customizable catalogs, and this PR seems to be adding a catalog-specific connector (HMS-only) that supports customizable table formats. I am asking about whether we can move slightly further to a more generic lakehouse connector where the catalog is configurable, and the table formats supported are whichever ones are supported by the configured catalog. |
I understand what you're saying now. Unity uses the HMS interface, so we could make it work with this. Depending on how the integration works, we could support other catalogs for multiple connectors by adding an HMS shim. Do you have suggestions on a better name? We're not changing the fundamental nature of how connectors work in Trino. Each table format works differently and has its own syntax for table creation, etc. From a documentation perspective, users will need to refer to the individual connector page to see how to create tables, what properties apply, etc. |
Naming depends on whether or not we plan for this to HMS-specific long-term. "lakehouse" is a great name if we intend for it to truly become the one-stop-shop for all lakehouse catalogs and table formats. But if it's tied to HMS specifically, then that should be reflected in the name. Many people have lakehouses that don't use HMS!
This is what is a little concerning to me. Why should we use the HMS interface as the "standard" interface for what Trino considers lakehouse catalogs? Yes, it is widely used today, but I'm hesitant to say that we should bet on it being the long-term de facto standard for lakehouse catalogs. That said -- as long as we are conceptually aligned on the |
@Override | ||
public Optional<ConnectorTableHandle> applyPartitioning(ConnectorSession session, ConnectorTableHandle tableHandle, Optional<ConnectorPartitioningHandle> partitioningHandle, List<ColumnHandle> columns) | ||
{ | ||
return forHandle(tableHandle).applyPartitioning(session, tableHandle, partitioningHandle, columns); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we verify that tableHandle
and partitioningHandle
are for the same format like in getCommonPartitioningHandle
as well or is that not expected to happen because this API is applied only during table scanning? (if I understand the new API correctly)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at getCommonPartitioningHandle
again, the verification seems potentially incorrect, as it might be called for e.g., an Iceberg and Hive table, in which case they should be treated as having incompatible partitioning.
For this method, the partitioning always comes from the table, so it shouldn't be possible for them to mismatch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces a new Lakehouse connector that unifies the functionalities of the Hive, Iceberg, Delta Lake, and Hudi connectors. It includes the implementation of core connector classes, module bindings, session/table properties aggregation, updated documentation, and CI configuration changes.
Reviewed Changes
Copilot reviewed 33 out of 35 changed files in this pull request and generated no comments.
Show a summary per file
File | Description |
---|---|
plugin/trino-lakehouse/src/main/java/io/trino/plugin/lakehouse/LakehouseSplitManager.java | Implements split management using the underlying connectors. |
plugin/trino-lakehouse/src/main/java/io/trino/plugin/lakehouse/LakehouseSessionProperties.java | Aggregates session properties from Hive, Iceberg, Delta Lake, and Hudi connectors. |
plugin/trino-lakehouse/src/main/java/io/trino/plugin/lakehouse/LakehousePlugin.java | Registers the Lakehouse connector via the plugin interface. |
plugin/trino-lakehouse/src/main/java/io/trino/plugin/lakehouse/LakehousePageSourceProviderFactory.java | Provides page source support based on table handle types. |
plugin/trino-lakehouse/src/main/java/io/trino/plugin/lakehouse/LakehousePageSinkProvider.java | Implements page sink support for different insert, merge, and execute handles. |
plugin/trino-lakehouse/src/main/java/io/trino/plugin/lakehouse/LakehouseNodePartitioningProvider.java | Offers bucket mapping and partitioning support by delegating to underlying providers. |
plugin/trino-lakehouse/src/main/java/io/trino/plugin/lakehouse/LakehouseModule.java | Configures dependency injection for core Lakehouse connector components. |
plugin/trino-lakehouse/src/main/java/io/trino/plugin/lakehouse/LakehouseIcebergModule.java | Sets up bindings specific to Iceberg integration. |
plugin/trino-lakehouse/src/main/java/io/trino/plugin/lakehouse/LakehouseHudiModule.java | Sets up bindings specific to Hudi integration. |
plugin/trino-lakehouse/src/main/java/io/trino/plugin/lakehouse/LakehouseHiveModule.java | Configures Hive-specific bindings and integrations. |
plugin/trino-lakehouse/src/main/java/io/trino/plugin/lakehouse/LakehouseFileSystemModule.java | Integrates file system support for the connector. |
plugin/trino-lakehouse/src/main/java/io/trino/plugin/lakehouse/LakehouseDeltaModule.java | Configures Delta Lake-specific components and scheduling. |
plugin/trino-lakehouse/src/main/java/io/trino/plugin/lakehouse/LakehouseConnectorFactory.java | Bootstraps the connector and creates its instance. |
plugin/trino-lakehouse/src/main/java/io/trino/plugin/lakehouse/LakehouseConnector.java | Implements the main connector logic including lifecycle and transaction management. |
plugin/trino-lakehouse/src/main/java/io/trino/plugin/lakehouse/LakehouseConfig.java | Provides configuration options (e.g., default table type) for the connector. |
docs/src/main/sphinx/connector/lakehouse.md | Documents connector configuration and examples. |
docs/src/main/sphinx/connector.md | Adds an entry for the Lakehouse connector. |
.github/workflows/ci.yml | Updates CI to include the Lakehouse connector in test modules. |
Files not reviewed (2)
- core/trino-server/src/main/provisio/trino.xml: Language not supported
- plugin/trino-lakehouse/pom.xml: Language not supported
Comments suppressed due to low confidence (1)
plugin/trino-lakehouse/src/main/java/io/trino/plugin/lakehouse/LakehouseConnectorFactory.java:56
- [nitpick] Consider renaming the use of '_' as the try-with-resources variable name to a more descriptive identifier (e.g., threadContextClassLoader) to avoid potential conflicts with reserved identifiers in newer Java versions.
try (ThreadContextClassLoader _ = new ThreadContextClassLoader(getClass().getClassLoader())) {
This pull request has gone a while without any activity. Ask for help on #core-dev on Trino slack. |
This isn't really about HMS per se, but is really about how table management works in environments like HMS. There are assumptions baked in to Hive-like systems that effect how we deal with tables, views, functions, and so on. This connector is based on the existing HMS centric connectors, and should be able to handle anything that is works this way. Today, this means HMS and Glue, and I expect this could be extended to Unity using the Unity REST apis. But, it would now work with Iceberg REST, because Iceberg REST catalogs just work a different way, and are single table format today. Anyway, this is a long way of saying "this connector is what people expect when they want to read data from a lake" |
This pull request has gone a while without any activity. Ask for help on #core-dev on Trino slack. |
Closing this pull request, as it has been stale for six weeks. Feel free to re-open at any time. |
I incorporated this connector in my local build and it is working. Thanks @electrum for this special connector. It helps to avoid creating different connectors based on table format against the same data source. Is there a way to filter the table list based on table format? (For example, list only iceberg tables under the Lakehouse catalog)? The use case is to track the adoption rate of new table formats. Minor changes are needed to get the code working in 476 though. LakehouseIcebergModule.java |
cba2855
to
bb3fcde
Compare
The Lakehouse connector combines the features of the | ||
[Hive](/connector/hive), [Iceberg](/connector/iceberg), | ||
[Delta Lake](/connector/delta-lake), and [Hudi](/connector/hudi) | ||
connectors into a single connector. It allows you to query or write | ||
to data stored in multiple table types (also known as table formats) | ||
that all share the same file system and metastore service. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should not define this in terms of the old system and instead we should just state the whole thing. So something like this (GenAI):
The Trino Lakehouse connector provides a unified way to interact with data stored in various table formats across different storage systems and metastore services. This single connector allows you to query and write data seamlessly, regardless of whether it's in Iceberg, Delta Lake, or Hudi table formats, or traditional Hive tables.
This connector offers flexible connectivity to popular metastore services like AWS Glue and Hive Metastore. For data storage, it supports a wide range of options including cloud storage services such as AWS S3, S3-compatible storage, Google Cloud Storage (GCS), and Azure Blob Storage, as well as HDFS installations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this, but it's important for users to understand that this actually does combine the connectors, as all of the config properties, session properties, table properties, behavior, etc., is that of the underlying connectors, and they need to read the appropriate documentation for whichever features they are using. Thoughts?
Description
The Lakehouse connector unifies the Hive, Iceberg, Delta Lake and Hudi connectors.
This doesn't yet implement callable procedures or table procedures.
Release notes
(x) Release notes are required, with the following suggested text: