Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Parquet Modular encryption support (write) #7111

Open
wants to merge 49 commits into
base: main
Choose a base branch
from

Conversation

rok
Copy link
Member

@rok rok commented Feb 10, 2025

Which issue does this PR close?

This PR is based on branch and an internal patch and /pull/6637 and aims to provide basic modular encryption support. Partially closes #3511. We decided to split encryption work into a separate PR.

Rationale for this change

See #3511.

What changes are included in this PR?

TBD

Are there any user-facing changes?

Several new classes and method parameters are introduced. If project is compiled without encryption flag changes are not breaking. If encryption flag is on some methods and constructors (e.g. ParquetMetaData::new) will require new parameters which would be a breaking change.

@github-actions github-actions bot added the parquet Changes to the parquet crate label Feb 10, 2025
@rok rok force-pushed the encryption-basics-fork branch 5 times, most recently from 8e6a4be to d1ccf75 Compare February 17, 2025 18:52
@rok rok force-pushed the encryption-basics-fork branch 3 times, most recently from f6d1155 to 05d4d60 Compare February 21, 2025 12:18
@adamreeve adamreeve force-pushed the encryption-basics-fork branch 3 times, most recently from dad9642 to 5d52ef7 Compare March 4, 2025 07:08
@rok rok force-pushed the encryption-basics-fork branch 2 times, most recently from 36d7c70 to bc75007 Compare March 4, 2025 19:06
@adamreeve adamreeve force-pushed the encryption-basics-fork branch 3 times, most recently from 24e70d9 to cc58a3a Compare March 6, 2025 00:11
@rok rok force-pushed the encryption-basics-fork branch 2 times, most recently from 65fca4b to 44559b0 Compare March 6, 2025 11:26
@rok rok force-pushed the encryption-basics-fork branch 2 times, most recently from c5a692c to 87224ea Compare March 10, 2025 20:05
@rok rok force-pushed the encryption-basics-fork branch 4 times, most recently from cab7dd4 to de5feef Compare March 11, 2025 12:50
@alamb alamb changed the title Parquet Modular encryption support Add Parquet Modular encryption support (write) Mar 12, 2025
@rok rok force-pushed the encryption-basics-fork branch 3 times, most recently from 7c01f34 to e93a0e5 Compare March 12, 2025 22:43
@rok rok force-pushed the encryption-basics-fork branch from e93a0e5 to c685136 Compare March 12, 2025 22:46
@rok rok marked this pull request as ready for review March 12, 2025 22:52
@rok rok requested review from etseidl, alamb and adamreeve and removed request for etseidl and alamb March 12, 2025 23:10
Copy link
Contributor

@adamreeve adamreeve left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't done a very thorough review but have left a few comments.

};

let mut page_header = page.to_thrift_header();
page_header.compressed_page_size = data.len() as i32;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than overwriting the compressed page size here, we should probably encrypt the page before creating the header. That might require making the page mutable to allow overwriting its buf member? I'm not sure how best to handle this but it's a little hacky at the moment.

page_header.compressed_page_size = data.len() as i32;

let mut header = Vec::with_capacity(1024);
match self.page_encryptor.as_ref() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we could tidy this up by having something like a PageModuleWriter trait that PageEncryptor could implement, and also have a non-encrypted implementation to avoid the need for all the #[cfg(feature = "encryption")]


#[cfg(feature = "encryption")]
#[test]
fn test_uniform_encryption_roundtrip() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should move these tests out to a test library, similar to #7279

columns_missing_in_schema.sort();
return Err(ParquetError::General(
format!(
"Column {} not found in schema",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This error could probably provide a bit more context, maybe something like:

Suggested change
"Column {} not found in schema",
"The following columns with encryption keys specified were not found in the schema: {}",

}
}

pub fn with_plaintext_footer(mut self, plaintext_footer: bool) -> Self {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These methods and the ones on FileEncryptionProperties are probably the main ones users will need to understand for encryption, so we should document them all.

&self.properties
}

pub fn file_aad(&self) -> &[u8] {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should document the difference between file_aad and aad_file_unique


#[derive(Debug, Clone, PartialEq)]
pub struct FileEncryptionProperties {
encrypt_footer: bool,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think encrypt_footer is just ignored at the moment. We should raise an error somewhere if we try to write with it set to false.


/// Checks if columns that are to be encrypted are present in schema
#[cfg(feature = "encryption")]
pub(crate) fn encrypted_columns_in_schema(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds a bit like it's getting the encrypted columns, maybe something like validate_encrypted_column_names would be a better name?

Ok(Self {
buf,
schema: schema.clone(),
descr: Arc::new(SchemaDescriptor::new(schema)),
props: properties,
props: properties.clone(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this clone is needed

Comment on lines +609 to +614
#[cfg(feature = "encryption")]
let page_writer =
SerializedPageWriter::new(buf).with_page_encryptor(page_encryptor);

#[cfg(not(feature = "encryption"))]
let page_writer = SerializedPageWriter::new(buf);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#[cfg(feature = "encryption")]
let page_writer =
SerializedPageWriter::new(buf).with_page_encryptor(page_encryptor);
#[cfg(not(feature = "encryption"))]
let page_writer = SerializedPageWriter::new(buf);
let page_writer = SerializedPageWriter::new(buf);
#[cfg(feature = "encryption")]
let page_writer = page_writer.with_page_encryptor(page_encryptor);

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Parquet Modular Encryption support
4 participants