Skip to content

Conversation

imjalpreet
Copy link
Member

Description

Add support for AWS Glue Table and Column Statistics

Motivation and Context

Impact

Users will be able to utilize statistics and enable CBO when using AWS Glue as a metastore.

Test Plan

Contributor checklist

  • Please make sure your submission complies with our contributing guide, in particular code style and commit standards.
  • PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
  • Documented new properties (with its default value), SQL syntax, functions, or other functionality.
  • If release notes are required, they follow the release notes guidelines.
  • Adequate tests were added if applicable.
  • CI passed.

Release Notes

Please follow release notes guidelines and fill in the release notes below.

== RELEASE NOTES ==

Hive Connector Changes
* Add support for AWS Glue Table and Column Statistics

@prestodb-ci prestodb-ci added the from:IBM PR from IBM label Oct 13, 2025
Copy link
Contributor

sourcery-ai bot commented Oct 13, 2025

Reviewer's Guide

This PR adds comprehensive support for AWS Glue table and column statistics by introducing a new columnStatisticsProvider abstraction (enabled or disabled via config), wiring it through the GlueHiveMetastore (for reads and batched writes), updating metastore APIs for multi‐partition updates, augmenting Glue API metrics, enhancing partition fetch logic, and adjusting tests to validate the new behavior.

Class diagram for new and updated Glue column statistics provider classes

classDiagram
class GlueColumnStatisticsProvider {
  <<interface>>
  +Set<ColumnStatisticType> getSupportedColumnStatistics(Type type)
  +Map<String, HiveColumnStatistics> getTableColumnStatistics(Table table)
  +Map<Partition, Map<String, HiveColumnStatistics>> getPartitionColumnStatistics(Collection<Partition> partitions)
  +void updateTableColumnStatistics(Table table, Map<String, HiveColumnStatistics> columnStatistics)
  +void updatePartitionStatistics(Set<PartitionStatisticsUpdate> partitionStatisticsUpdates)
}

class DefaultGlueColumnStatisticsProvider {
  +Set<ColumnStatisticType> getSupportedColumnStatistics(Type type)
  +Map<String, HiveColumnStatistics> getTableColumnStatistics(Table table)
  +Map<Partition, Map<String, HiveColumnStatistics>> getPartitionColumnStatistics(Collection<Partition> partitions)
  +void updateTableColumnStatistics(Table table, Map<String, HiveColumnStatistics> columnStatistics)
  +void updatePartitionStatistics(Set<PartitionStatisticsUpdate> partitionStatisticsUpdates)
}

class DisabledGlueColumnStatisticsProvider {
  +Set<ColumnStatisticType> getSupportedColumnStatistics(Type type)
  +Map<String, HiveColumnStatistics> getTableColumnStatistics(Table table)
  +Map<Partition, Map<String, HiveColumnStatistics>> getPartitionColumnStatistics(Collection<Partition> partitions)
  +void updateTableColumnStatistics(Table table, Map<String, HiveColumnStatistics> columnStatistics)
  +void updatePartitionStatistics(Set<PartitionStatisticsUpdate> partitionStatisticsUpdates)
}

GlueColumnStatisticsProvider <|.. DefaultGlueColumnStatisticsProvider
GlueColumnStatisticsProvider <|.. DisabledGlueColumnStatisticsProvider

class GlueColumnStatisticsProvider.PartitionStatisticsUpdate {
  +Partition getPartition()
  +Map<String, HiveColumnStatistics> getColumnStatistics()
}
Loading

Class diagram for updated GlueHiveMetastore and config wiring

classDiagram
class GlueHiveMetastore {
  -GlueColumnStatisticsProvider columnStatisticsProvider
  -boolean enableColumnStatistics
  -Executor partitionsReadExecutor
  +Set<ColumnStatisticType> getSupportedColumnStatistics(MetastoreContext, Type)
  +PartitionStatistics getTableStatistics(...)
  +Map<String, PartitionStatistics> getPartitionStatistics(...)
  +void updateTableStatistics(...)
  +void updatePartitionStatistics(...)
}

class GlueHiveMetastoreConfig {
  +boolean columnStatisticsEnabled
  +int readStatisticsThreads
  +int writeStatisticsThreads
  +boolean isColumnStatisticsEnabled()
  +int getReadStatisticsThreads()
  +int getWriteStatisticsThreads()
}

class GlueMetastoreModule {
  +@Provides Executor createStatisticsReadExecutor(...)
  +@Provides Executor createStatisticsWriteExecutor(...)
}

GlueHiveMetastore --> GlueColumnStatisticsProvider
GlueHiveMetastore --> GlueHiveMetastoreConfig
GlueMetastoreModule --> GlueHiveMetastoreConfig
Loading

File-Level Changes

Change Details Files
Introduce GlueColumnStatisticsProvider abstraction and implementations
  • Define GlueColumnStatisticsProvider interface
  • Implement DefaultGlueColumnStatisticsProvider and DisabledGlueColumnStatisticsProvider
  • Add GlueStatConverter for converting between Trino and Glue stats
  • Introduce @ForGlueColumnStatisticsRead and @ForGlueColumnStatisticsWrite qualifiers
  • Extend GlueMetastoreStats with new metrics for column‐stats operations
  • Add config options (columnStatisticsEnabled, read/write threads) in GlueHiveMetastoreConfig
  • Bind provider and new executors in GlueMetastoreModule
GlueColumnStatisticsProvider.java
DefaultGlueColumnStatisticsProvider.java
DisabledGlueColumnStatisticsProvider.java
GlueStatConverter.java
ForGlueColumnStatisticsRead.java
ForGlueColumnStatisticsWrite.java
GlueMetastoreStats.java
GlueHiveMetastoreConfig.java
GlueMetastoreModule.java
Wire column statistics into GlueHiveMetastore
  • Inject and initialize columnStatisticsProvider based on config
  • Delegate getSupportedColumnStatistics, getTableStatistics, getPartitionStatistics to provider
  • Update updateTableStatistics and updatePartitionStatistics to propagate basic and column stats in batches
GlueHiveMetastore.java
Extend partition‐stats API to support multi‐partition updates
  • Change updatePartitionStatistics signature to accept Map of partitionName→update function
  • Update ExtendedHiveMetastore and all implementations (SemiTransactional, File, InMemoryCaching, Bridging, Recording) to use the new API
ExtendedHiveMetastore.java
SemiTransactionalHiveMetastore.java
FileHiveMetastore.java
InMemoryCachingHiveMetastore.java
BridgingHiveMetastore.java
RecordingHiveMetastore.java
Enhance partition fetching and executor use
  • Rename and switch to partitionsReadExecutor for parallel fetches
  • Modify batchGetPartition to loop on unprocessedKeys to ensure progress
GlueHiveMetastore.java
Update test suites for statistics support
  • Enable columnStatisticsEnabled in TestHiveClientGlueMetastore
  • Skip or adjust tests that assumed no column stats
  • Add new property mappings in TestGlueHiveMetastoreConfig
  • Adjust expected column‐stats values in AbstractTestHiveClient
TestHiveClientGlueMetastore.java
TestGlueHiveMetastoreConfig.java
AbstractTestHiveClient.java

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Co-authored-by: Deepak Majeti <[email protected]>
Co-authored-by: George Wang <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

from:IBM PR from IBM

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants