-
Notifications
You must be signed in to change notification settings - Fork 892
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Parquet Modular encryption support (write) #7111
base: main
Are you sure you want to change the base?
Conversation
8e6a4be
to
d1ccf75
Compare
f6d1155
to
05d4d60
Compare
dad9642
to
5d52ef7
Compare
36d7c70
to
bc75007
Compare
24e70d9
to
cc58a3a
Compare
65fca4b
to
44559b0
Compare
c5a692c
to
87224ea
Compare
cab7dd4
to
de5feef
Compare
7c01f34
to
e93a0e5
Compare
e93a0e5
to
c685136
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't done a very thorough review but have left a few comments.
}; | ||
|
||
let mut page_header = page.to_thrift_header(); | ||
page_header.compressed_page_size = data.len() as i32; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than overwriting the compressed page size here, we should probably encrypt the page before creating the header. That might require making the page mutable to allow overwriting its buf
member? I'm not sure how best to handle this but it's a little hacky at the moment.
page_header.compressed_page_size = data.len() as i32; | ||
|
||
let mut header = Vec::with_capacity(1024); | ||
match self.page_encryptor.as_ref() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we could tidy this up by having something like a PageModuleWriter
trait that PageEncryptor
could implement, and also have a non-encrypted implementation to avoid the need for all the #[cfg(feature = "encryption")]
|
||
#[cfg(feature = "encryption")] | ||
#[test] | ||
fn test_uniform_encryption_roundtrip() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should move these tests out to a test library, similar to #7279
columns_missing_in_schema.sort(); | ||
return Err(ParquetError::General( | ||
format!( | ||
"Column {} not found in schema", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This error could probably provide a bit more context, maybe something like:
"Column {} not found in schema", | |
"The following columns with encryption keys specified were not found in the schema: {}", |
} | ||
} | ||
|
||
pub fn with_plaintext_footer(mut self, plaintext_footer: bool) -> Self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These methods and the ones on FileEncryptionProperties
are probably the main ones users will need to understand for encryption, so we should document them all.
&self.properties | ||
} | ||
|
||
pub fn file_aad(&self) -> &[u8] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should document the difference between file_aad
and aad_file_unique
|
||
#[derive(Debug, Clone, PartialEq)] | ||
pub struct FileEncryptionProperties { | ||
encrypt_footer: bool, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think encrypt_footer
is just ignored at the moment. We should raise an error somewhere if we try to write with it set to false.
|
||
/// Checks if columns that are to be encrypted are present in schema | ||
#[cfg(feature = "encryption")] | ||
pub(crate) fn encrypted_columns_in_schema( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sounds a bit like it's getting the encrypted columns, maybe something like validate_encrypted_column_names
would be a better name?
Ok(Self { | ||
buf, | ||
schema: schema.clone(), | ||
descr: Arc::new(SchemaDescriptor::new(schema)), | ||
props: properties, | ||
props: properties.clone(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this clone is needed
#[cfg(feature = "encryption")] | ||
let page_writer = | ||
SerializedPageWriter::new(buf).with_page_encryptor(page_encryptor); | ||
|
||
#[cfg(not(feature = "encryption"))] | ||
let page_writer = SerializedPageWriter::new(buf); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#[cfg(feature = "encryption")] | |
let page_writer = | |
SerializedPageWriter::new(buf).with_page_encryptor(page_encryptor); | |
#[cfg(not(feature = "encryption"))] | |
let page_writer = SerializedPageWriter::new(buf); | |
let page_writer = SerializedPageWriter::new(buf); | |
#[cfg(feature = "encryption")] | |
let page_writer = page_writer.with_page_encryptor(page_encryptor); |
Which issue does this PR close?
This PR is based on branch and an internal patch and /pull/6637 and aims to provide basic modular encryption support. Partially closes #3511. We decided to split encryption work into a separate PR.
Rationale for this change
See #3511.
What changes are included in this PR?
TBD
Are there any user-facing changes?
Several new classes and method parameters are introduced. If project is compiled without
encryption
flag changes are not breaking. Ifencryption
flag is on some methods and constructors (e.g.ParquetMetaData::new
) will require new parameters which would be a breaking change.