Skip to content

Add Parquet decryption support for Hive tables #24517

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

sopel39
Copy link
Member

@sopel39 sopel39 commented Dec 18, 2024

Description

Adds support to read Hive tables with encrypted Parquet files.
PS: This PR is work in progress and we are adding tests to it.

Additional context and related issues

Parquet added support for encryption https://parquet.apache.org/docs/file-format/data-pages/encryption/. Spark also added support to read and write tables with parquet encrypted files. In this PR we are adding support to read Hive tables with encrypted Parquet files with Trino.

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
( ) Release notes are required, with the following suggested text:

# Section
* Fix some things. ({issue}`issuenumber`)

@cla-bot cla-bot bot added the cla-signed label Dec 18, 2024
@github-actions github-actions bot added hudi Hudi connector iceberg Iceberg connector delta-lake Delta Lake connector hive Hive connector labels Dec 18, 2024
@electrum
Copy link
Member

This is loading classes dynamically based on class names in config. We should use the standard Trino pattern of having explicitly enumerated providers, each with their own strongly typed config classes.

@sopel39
Copy link
Member Author

sopel39 commented Dec 18, 2024

This is loading classes dynamically based on class names in config. We should use the standard Trino pattern of having explicitly enumerated providers, each with their own strongly typed config classes.

Yes. I just rebased for now and resolved conflicts

Copy link

This pull request has gone a while without any activity. Tagging for triage help: @mosabua

@github-actions github-actions bot added the stale label Feb 10, 2025
Copy link

github-actions bot commented Mar 3, 2025

Closing this pull request, as it has been stale for six weeks. Feel free to re-open at any time.

@github-actions github-actions bot closed this Mar 3, 2025
@sopel39 sopel39 reopened this Apr 2, 2025
@github-actions github-actions bot added redshift Redshift connector and removed stale labels Apr 2, 2025
@sopel39 sopel39 force-pushed the ks/pme branch 3 times, most recently from 990f6f2 to 07025cc Compare April 17, 2025 15:29
@sopel39 sopel39 changed the title [WIP] Add Parquet decryption support for Hive tables Add Parquet decryption support for Hive tables Apr 18, 2025
EnvironmentDecryptionKeyRetriever is added as a key provider initially.
@sopel39 sopel39 marked this pull request as ready for review April 22, 2025 14:33
@sopel39 sopel39 requested a review from raunaqmorarka April 22, 2025 14:33
@sopel39
Copy link
Member Author

sopel39 commented Apr 22, 2025

ptal @ggershinsky

@sopel39
Copy link
Member Author

sopel39 commented Apr 22, 2025

ptal @dfangs @shangxinli

Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This pull request adds support for reading Hive tables with encrypted Parquet files by introducing new cryptography utilities and updating the data page structure to include a page index. Key changes include:

  • New crypto classes and methods for AES GCM/CTR encryption and decryption (e.g. AesGcmEncryptor/Decryptor, AesCtrEncryptor/Decryptor).
  • Introduction of a FileDecryptionContext to manage per-column decryption state and key retrieval.
  • Updates to DataPage, DataPageV1, and DataPageV2 to include a new pageIndex field for tracking page order.

Reviewed Changes

Copilot reviewed 72 out of 73 changed files in this pull request and generated 1 comment.

File Description
lib/trino-parquet/src/main/java/io/trino/parquet/crypto/* Implementation of new encryption/decryption utilities and context management for Parquet file decryption
lib/trino-parquet/src/main/java/io/trino/parquet/* Updated page classes now include pageIndex to support the new decryption features
Files not reviewed (1)
  • lib/trino-parquet/pom.xml: Language not supported
Comments suppressed due to low confidence (1)

lib/trino-parquet/src/main/java/io/trino/parquet/crypto/FileDecryptionContext.java:118

  • Ensure comprehensive unit tests cover both scenarios—when a column is decrypted with the footer key and when a column-specific key is used—including edge cases where keys are missing or invalid.
public Optional<ColumnDecryptionContext> initializeColumnCryptoMetadata(ColumnPath path, boolean encryptedWithFooterKey, Optional<byte[]> columnKeyMetadata, int columnOrdinal)

// AES_GCM_CTR_V1
if (columnKey.isEmpty()) {
// Decryptor with footer key
if (aesCtrDecryptorWithFooterKey == null) {
Copy link
Preview

Copilot AI Apr 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cached decryptor instances (aesCtrDecryptorWithFooterKey and similarly aesGcmDecryptorWithFooterKey) are stored and reused. If these decryptors are used concurrently, they may exhibit thread-safety issues; consider ensuring thread confinement or creating a new instance per decryption operation.

Copilot is powered by AI, so mistakes are possible. Review output carefully before use.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla-signed delta-lake Delta Lake connector hive Hive connector hudi Hudi connector iceberg Iceberg connector redshift Redshift connector
Development

Successfully merging this pull request may close these issues.

2 participants