-
Notifications
You must be signed in to change notification settings - Fork 5.5k
feat(aws-glue): Add support for AWS Glue Table and Column Statistics #26297
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Reviewer's GuideThis PR adds comprehensive support for AWS Glue table and column statistics by introducing a new columnStatisticsProvider abstraction (enabled or disabled via config), wiring it through the GlueHiveMetastore (for reads and batched writes), updating metastore APIs for multi‐partition updates, augmenting Glue API metrics, enhancing partition fetch logic, and adjusting tests to validate the new behavior. Class diagram for new and updated Glue column statistics provider classesclassDiagram
class GlueColumnStatisticsProvider {
<<interface>>
+Set<ColumnStatisticType> getSupportedColumnStatistics(Type type)
+Map<String, HiveColumnStatistics> getTableColumnStatistics(Table table)
+Map<Partition, Map<String, HiveColumnStatistics>> getPartitionColumnStatistics(Collection<Partition> partitions)
+void updateTableColumnStatistics(Table table, Map<String, HiveColumnStatistics> columnStatistics)
+void updatePartitionStatistics(Set<PartitionStatisticsUpdate> partitionStatisticsUpdates)
}
class DefaultGlueColumnStatisticsProvider {
+Set<ColumnStatisticType> getSupportedColumnStatistics(Type type)
+Map<String, HiveColumnStatistics> getTableColumnStatistics(Table table)
+Map<Partition, Map<String, HiveColumnStatistics>> getPartitionColumnStatistics(Collection<Partition> partitions)
+void updateTableColumnStatistics(Table table, Map<String, HiveColumnStatistics> columnStatistics)
+void updatePartitionStatistics(Set<PartitionStatisticsUpdate> partitionStatisticsUpdates)
}
class DisabledGlueColumnStatisticsProvider {
+Set<ColumnStatisticType> getSupportedColumnStatistics(Type type)
+Map<String, HiveColumnStatistics> getTableColumnStatistics(Table table)
+Map<Partition, Map<String, HiveColumnStatistics>> getPartitionColumnStatistics(Collection<Partition> partitions)
+void updateTableColumnStatistics(Table table, Map<String, HiveColumnStatistics> columnStatistics)
+void updatePartitionStatistics(Set<PartitionStatisticsUpdate> partitionStatisticsUpdates)
}
GlueColumnStatisticsProvider <|.. DefaultGlueColumnStatisticsProvider
GlueColumnStatisticsProvider <|.. DisabledGlueColumnStatisticsProvider
class GlueColumnStatisticsProvider.PartitionStatisticsUpdate {
+Partition getPartition()
+Map<String, HiveColumnStatistics> getColumnStatistics()
}
Class diagram for updated GlueHiveMetastore and config wiringclassDiagram
class GlueHiveMetastore {
-GlueColumnStatisticsProvider columnStatisticsProvider
-boolean enableColumnStatistics
-Executor partitionsReadExecutor
+Set<ColumnStatisticType> getSupportedColumnStatistics(MetastoreContext, Type)
+PartitionStatistics getTableStatistics(...)
+Map<String, PartitionStatistics> getPartitionStatistics(...)
+void updateTableStatistics(...)
+void updatePartitionStatistics(...)
}
class GlueHiveMetastoreConfig {
+boolean columnStatisticsEnabled
+int readStatisticsThreads
+int writeStatisticsThreads
+boolean isColumnStatisticsEnabled()
+int getReadStatisticsThreads()
+int getWriteStatisticsThreads()
}
class GlueMetastoreModule {
+@Provides Executor createStatisticsReadExecutor(...)
+@Provides Executor createStatisticsWriteExecutor(...)
}
GlueHiveMetastore --> GlueColumnStatisticsProvider
GlueHiveMetastore --> GlueHiveMetastoreConfig
GlueMetastoreModule --> GlueHiveMetastoreConfig
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
Co-authored-by: Deepak Majeti <[email protected]> Co-authored-by: George Wang <[email protected]>
b04463a
to
c44ca2d
Compare
Description
Add support for AWS Glue Table and Column Statistics
Motivation and Context
Impact
Users will be able to utilize statistics and enable CBO when using AWS Glue as a metastore.
Test Plan
Contributor checklist
Release Notes
Please follow release notes guidelines and fill in the release notes below.