Skip to content

Conversation

@corwinjoy
Copy link
Contributor

Description

This PR adds encryption support and other advanced file options to delta-rs by implementing a comprehensive framework for file format settings. The changes enable users to configure encryption settings, customize writer properties, and apply file-level formatting options when reading and writing Delta tables.

  • Introduces a FileFormatOptions trait and related infrastructure to handle file-specific configurations
  • Adds support for both simple property-based encryption and KMS-based encryption through new factory patterns
  • Updates all operation builders to accept and propagate file format options throughout the write/read pipeline

In general, we have added a new trait called FileFormatOptions at the root DeltaTable level to unify how files within a delta table are read and written with specific formatting. The idea is that you can apply these settings once, at the top level, and then seamlessly perform any operations with the necessary settings.

This PR leverages the DataFusion TableOptions structure to support format options for multiple underlying file formats. (The idea being that delta-rs may eventually want to support storage formats beyond Parquet, such as Vortex or Lance.) Additionally, it centralizes file format options in a single, consistent location. This avoids the current difficulties where one has to separately set WriterProperties; then reader properties as part of the SessionState. (This is in line with comments from @roeap about how file configuration might be improved: #3300 (comment)). We would also like to eventually extend this upgrade to add notations about these file configurations to the delta table properties. For example, if the files are encrypted, one could add a KMS configuration for where to retrieve encryption keys.

Review Suggestion

This PR turned out to be larger than we hoped, so apologies for that, but I don't know how to split it into smaller pieces.
When reviewing, we suggest starting with the file crates/core/src/table/file_format_options.rs to get an overview of the new file format trait that can be applied to delta tables.

Related Issue(s)

Support Parquet Modular Encryption:
#3300

Documentation

Parquet Modular Encryption: https://docs.google.com/document/d/1MUg1J7u5VdLkgejJ4ybzfZt1OmwhQkq2iGPxsn4gqLI/edit?tab=t.0#heading=h.34wvmhc1zdch

Attribution

This PR was created in collaboration with @adamreeve

@github-actions github-actions bot added binding/python Issues for the Python package binding/rust Issues for the Rust crate labels Sep 29, 2025
@github-actions
Copy link

ACTION NEEDED

delta-rs follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

@corwinjoy
Copy link
Contributor Author

Note that fully supporting Parquet encryption requires being able to get write and read properties per-file, which is why the existing ability to set WriterProperties isn't sufficient, and why WriterPropertiesFactory::create_writer_properties is called per file and requires a file path. This allows generating new random data encryption keys per file and performing tasks such as specifying a per-file AAD prefix or supporting the external storage of encryption keys that can be looked up using the file path.

@corwinjoy
Copy link
Contributor Author

@rtyler @roeap @alamb Tagging you here per our previous discussion on adding encryption support to delta-rs.

@corwinjoy corwinjoy changed the title feat: Add framework for File Format Options feat: add framework for File Format Options Sep 29, 2025
@rtyler rtyler self-assigned this Sep 30, 2025
@rtyler rtyler marked this pull request as draft September 30, 2025 13:18
@rtyler
Copy link
Member

rtyler commented Sep 30, 2025

I have marked this pull request as draft. This does not compile as is, I can come back to it once it is able to compile and pass unit tests

@corwinjoy
Copy link
Contributor Author

I have marked this pull request as draft. This does not compile as is, I can come back to it once it is able to compile and pass unit tests

@rtyler OK. It seems that when I auto-merged the main branch it introduced a build error. I have resolved this and the code is once again building and passing unit tests.

@corwinjoy corwinjoy marked this pull request as ready for review October 1, 2025 01:31
Copy link
Collaborator

@ion-elgreco ion-elgreco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see the benefit but we really need to reduce the surface of change that are being introduced

Ok(DeltaTable::new_with_state(
this.log_store,
commit.snapshot(),
None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not change the function signature here and in the other builders

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you be more specific about what you are looking for? Checking the code base, I see 25 calls to this constructor, and in 10/25 of the cases, I need to pass file_format_options to maintain the needed settings. I guess I could eliminate the 15 cases where I pass None by splitting this into two named constructors...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to use DeltaTableConfig for this:

/// Configuration options for delta table
#[derive(Debug, Serialize, Deserialize, Clone, DeltaConfig)]
#[serde(rename_all = "camelCase")]
pub struct DeltaTableConfig {
/// Indicates whether DeltaTable should track files.
/// This defaults to `true`
///
/// Some append-only applications might have no need of tracking any files.
/// Hence, DeltaTable will be loaded with significant memory reduction.
pub require_files: bool,
/// Controls how many files to buffer from the commit log when updating the table.
/// This defaults to 4 * number of cpus
///
/// Setting a value greater than 1 results in concurrent calls to the storage api.
/// This can decrease latency if there are many files in the log since the
/// last checkpoint, but will also increase memory usage. Possible rate limits of the storage backend should
/// also be considered for optimal performance.
pub log_buffer_size: usize,
/// Control the number of records to read / process from the commit / checkpoint files
/// when processing record batches.
pub log_batch_size: usize,
#[serde(skip_serializing, skip_deserializing)]
#[delta(skip)]
/// When a runtime handler is provided, all IO tasks are spawn in that handle
pub io_runtime: Option<IORuntime>,
}

Copy link
Contributor Author

@corwinjoy corwinjoy Oct 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. I think you are suggesting that I add file_format_options to the config member of DeltaTable.
In fact, that was the approach I tried initially but I had to back that out and add this as a direct member for the following reasons:

  1. The file format options don't really seem to fit properly into DeltaTableConfig since these options seem to be more about managing the logfile.
  2. We need to preserve the formatting options when going from DeltaTable to DeltaOps and back. Right now, the config gets lost when new_with_state is called. See below where the config gets reset to default:
    pub(crate) fn new_with_state(log_store: LogStoreRef, state: DeltaTableState) -> Self {

    We need to preserve these settings throughout any chained operations. This means we still need an extra parameter in new_with_state to preserve any existings config. Also, I felt it was cleaner to create a new entry and directly set and pass it. Possibly we could move this into DeltaTableConfig but I think this may be more of a hindrance than a help.

@adamreeve It was a couple of months ago that we moved this, do you remember anything else from our discussion? See commit below:
666f0ba

Copy link
Collaborator

@ion-elgreco ion-elgreco Oct 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Correct but they are related to loading the table, which file formats (encryption) should belong as well
  2. DeltaTableConfig tags along with the snapshot:
    /// Get the table config which is loaded with of the snapshot
    pub fn load_config(&self) -> &DeltaTableConfig {
    self.snapshot.load_config()
    }

so that shouldn't be an issue to allow it to stay there when you do new_with_state

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. I think it makes sense to move this here. I will investigate this week. The main issue is whether I have to support serialization out of the gate to make this work / at what points config gets reconstructed from serialized properties. The reason why is that serialization of the FileFormatOptions will be a bit tricky because:

  1. I'm not sure if TableOptions fully supports serialization.
  2. For direct encryption and decryption properties, we will need to modify the serialization to make sure that passwords don't get serialized.

#[allow(clippy::too_many_arguments)]
async fn execute_non_empty_expr(
snapshot: &DeltaTableState,
file_format_options: Option<FileFormatRef>,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better if this pushed into the LoadConfig and not passed through each function

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate a bit here? There are a lot of these different execute functions (one for every option) and I agree that it would be nice if they had a common configuration structure rather than the somewhat long list of arguments they take. That would probably be independent of this PR. For now, we have just added an additional argument to pass the needed settings through to execution.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See ping above, it needs to be DeltaTableConfig

@roeap
Copy link
Collaborator

roeap commented Oct 1, 2025

@corwinjoy - awesome to see this come to fruition! Will find some time to give this a review hopefully tomorrow.

At first glance one quick question. Do we see a way to "bundle" the datafusion specific stuff a bit more? It's a bit hard to keep track of all the individual flags while reviewing :)

@corwinjoy
Copy link
Contributor Author

@roeap

At first glance one quick question. Do we see a way to "bundle" the datafusion specific stuff a bit more? It's a bit hard to keep track of all the individual flags while reviewing :)

What we did to minimize this dependency is define an abstract FileFormatOptions trait. Everything just passes around a FileFormatRef defined as Arc<dyn FileFormatOptions>. Then, only when needed, do we grab final table options or writer properties. Furthermore, we've gated these instances of getting final details behind three function calls in file_format_options.rs:

pub fn build_writer_properties_factory_ffo(
    file_format_options: Option<FileFormatRef>,
) -> Option<Arc<dyn WriterPropertiesFactory>> {...}

pub fn to_table_parquet_options_from_ffo(
    file_format_options: Option<&FileFormatRef>,
) -> Option<TableParquetOptions> {...}

pub fn state_with_file_format_options(
    state: SessionState,
    file_format_options: Option<&FileFormatRef>,
) -> DeltaResult<SessionState> {...}

There might be some ways to refine this further, but in general we've tried to isolate and abstract these file properties where possible and not require datafusion.

@corwinjoy
Copy link
Contributor Author

@roeap From a user point of view, we've tried hard to make the settings as easy as possible. This can be seen in crates/deltalake/examples/basic_operations_encryption.rs. Here, we demonstrate different kinds of operations on tables. (We have a more formal unit test at crates/core/tests/commands_with_encryption.rs). Thes code examples all look like ordinary operations; all we needed was a common function call when creating DeltaOps:

async fn ops_with_crypto(
    uri: &str,
    file_format_options: &FileFormatRef,
) -> Result<DeltaOps, DeltaTableError> {
    let prefix_uri = format!("file://{}", uri);
    let url = Url::parse(&*prefix_uri).unwrap();
    let ops = DeltaOps::try_from_uri(url).await?;
    Ok(ops.with_file_format_options(file_format_options.clone()))
}

Calling with_file_format_options is sufficient to apply the needed encryption settings for all operations.

# Conflicts:
#	crates/core/src/delta_datafusion/table_provider.rs
#	crates/core/src/operations/delete.rs
#	crates/core/src/operations/drop_constraints.rs
#	crates/core/src/operations/filesystem_check.rs
#	crates/core/src/operations/load.rs
#	crates/core/src/operations/merge/mod.rs
#	crates/core/src/operations/mod.rs
#	crates/core/src/operations/optimize.rs
#	crates/core/src/operations/restore.rs
#	crates/core/src/operations/update.rs
#	crates/core/src/operations/write/mod.rs
#	crates/core/tests/command_optimize.rs
#	crates/core/tests/integration_datafusion.rs
Signed-off-by: Corwin Joy <[email protected]>
# Conflicts:
#	crates/core/src/operations/optimize.rs
Signed-off-by: Corwin Joy <[email protected]>
@rtyler rtyler marked this pull request as draft October 5, 2025 17:50
@corwinjoy
Copy link
Contributor Author

Still working on this to move the file config to DeltaTableConfig, but doing this in a separate branch to keep things clean. It's fairly tricky, so it will take a bit. I also plan to improve the unit tests to confirm that files are really being encrypted.

@github-actions github-actions bot removed the binding/python Issues for the Python package label Oct 13, 2025
# Conflicts:
#	crates/core/src/delta_datafusion/mod.rs
#	crates/core/src/delta_datafusion/table_provider.rs
#	crates/core/src/operations/delete.rs
#	crates/core/src/operations/load.rs
#	crates/core/src/operations/merge/mod.rs
#	crates/core/src/operations/optimize.rs
#	crates/core/src/operations/update.rs
#	crates/core/src/operations/write/execution.rs
#	crates/core/src/operations/write/mod.rs
#	crates/core/tests/command_optimize.rs
@corwinjoy
Copy link
Contributor Author

corwinjoy commented Oct 14, 2025

@ion-elgreco OK. I have migrated these file options to the config property in DeltaTable. This definitely reduced the changes quite a bit so thanks for the suggestion! I think it looks pretty solid and look forward to your feedback when you are ready.
@adamreeve

@corwinjoy corwinjoy marked this pull request as ready for review October 14, 2025 21:39
}
Ok(self)
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ion-elgreco We could use some feedback on the best way to set the config for an existing table. In this design, we wanted to:

  1. Be able to set the config at runtime, not just at table construction. This is important for fields like passwords, where we will not want them to be serialized, or other options that we may want to change at runtime.
  2. Make the user interface easy to use.
    With this design, setting the config for any operation looks like: (from crates/deltalake/examples/basic_operations_encryption.rs)
    let ops = DeltaOps::try_from_uri(url).await?;
    let ops = ops
        .with_file_format_options(file_format_options.clone())
        .await?;

We also considered a design where you could only set this via DeltaTableBuilder but I wanted to make this feature easy to use. See e.g. the following diff where this function is removed.

adamreeve@ec40881

@adamreeve

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've just pushed a fixed commit that handles when the load fails if the table isn't created yet: adamreeve@0d551a5

This matches the behaviour of DeltaOps::try_from_uri. Maybe there could be a DeltaOps::try_from_table or TryFrom<DeltaTable> for DeltaOps implementation to handle that scenario.

Copy link
Collaborator

@ion-elgreco ion-elgreco Oct 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am having hard time understanding this usecase. Why wouldn't it be enough to update the DeltaTableConfig, I would assume without setting the encryption you couldn't load the table at all

@ion-elgreco
Copy link
Collaborator

@ion-elgreco OK. I have migrated these file options to the config property in DeltaTable. This definitely reduced the changes quite a bit so thanks for the suggestion! I think it looks pretty solid and look forward to your feedback when you are ready. @adamreeve

I'll do another review over the weekend

@codecov
Copy link

codecov bot commented Oct 15, 2025

Codecov Report

❌ Patch coverage is 85.25253% with 73 lines in your changes missing coverage. Please review.
✅ Project coverage is 73.85%. Comparing base (9f67f9f) to head (c7281de).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
crates/core/src/table/file_format_options.rs 89.09% 10 Missing and 8 partials ⚠️
crates/core/src/operations/encryption.rs 70.45% 10 Missing and 3 partials ⚠️
crates/core/src/operations/optimize.rs 85.07% 5 Missing and 5 partials ⚠️
crates/core/src/writer/record_batch.rs 65.38% 6 Missing and 3 partials ⚠️
crates/core/src/operations/write/writer.rs 91.07% 2 Missing and 3 partials ⚠️
crates/core/src/table/builder.rs 55.55% 4 Missing ⚠️
crates/core/src/operations/delete.rs 80.00% 2 Missing and 1 partial ⚠️
crates/core/src/operations/merge/mod.rs 71.42% 2 Missing ⚠️
crates/core/src/operations/update.rs 75.00% 2 Missing ⚠️
crates/core/src/operations/write/mod.rs 81.81% 2 Missing ⚠️
... and 4 more
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3794      +/-   ##
==========================================
- Coverage   73.99%   73.85%   -0.15%     
==========================================
  Files         148      153       +5     
  Lines       38904    39490     +586     
  Branches    38904    39490     +586     
==========================================
+ Hits        28788    29165     +377     
- Misses       8850     9028     +178     
- Partials     1266     1297      +31     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

snapshot: &EagerSnapshot,
log_store: LogStoreRef,
session: &dyn Session,
file_format_options: Option<&FileFormatRef>,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not needed anymore, since you can access the snapshot.load_config() to access the file_format_options

snapshot: &EagerSnapshot,
log_store: LogStoreRef,
session: &dyn Session,
file_format_options: Option<&FileFormatRef>,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here it's already available in the snapshot. so we can defer at latest stage to grab these from the load_config

// Add path column
used_columns.push(logical_schema.index_of(scan_config.file_column_name.as_ref().unwrap())?);

let table_parquet_options = to_table_parquet_options_from_ffo(file_format_options);
Copy link
Collaborator

@ion-elgreco ion-elgreco Oct 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer we Impl Into here between these two structs.

limit: Option<usize>,
files: Option<&'a [Add]>,
config: Option<DeltaScanConfig>,
parquet_options: Option<TableParquetOptions>,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest we remove this here. We can move all this logic into:

DeltaScanConfigBuilder.build(), there we can introspect the TableLoadConfig and set the parquetOptions on the DeltaScanConfig

@ion-elgreco ion-elgreco self-assigned this Oct 20, 2025
Comment on lines +758 to +768
if let Some(format_options) = &self.config.file_format_options {
format_options.update_session(session)?;
}
let filter_expr = conjunction(filters.iter().cloned());

let scan = DeltaScanBuilder::new(self.snapshot()?.snapshot(), self.log_store(), session)
.with_parquet_options(
crate::table::file_format_options::to_table_parquet_options_from_ffo(
self.config.file_format_options.as_ref(),
),
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be removed since this will happen inside the DeltaScanBuilder with my suggestion above

config: DeltaScanConfig,
schema: Arc<Schema>,
files: Option<Vec<Add>>,
file_format_options: Option<FileFormatRef>,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, not needed anymore, since we can pass it through the DeltaScanConfig

Ok((operation, metrics))
}

async fn get_file_decryption_properties(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can this return None as well?

}
}

/// Extension trait to obtain a `WriterPropertiesBuilder` from an existing
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

56.2.0 has landed so this can be removed

}
}

pub fn build_writer_properties_factory_ffo(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need a function for this 🤷

file_format_options: Option<FileFormatRef>,
) -> WriterPropertiesFactoryRef {
build_writer_properties_factory_ffo(file_format_options)
.unwrap_or_else(|| build_writer_properties_factory_default())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unwrap_or_default is more idiomatic

}

#[cfg(feature = "datafusion")]
pub fn to_table_parquet_options_from_ffo(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same applies here, I dont think we need a function for just a map()

@ion-elgreco
Copy link
Collaborator

@corwinjoy looks already better but I think we can reduce the amount of line changes even more!

I still have to take a better look at why we need a WriterPropertiesFactory :s but maybe you can explain it shortly for me?

@adamreeve
Copy link
Contributor

I still have to take a better look at why we need a WriterPropertiesFactory :s but maybe you can explain it shortly for me?

Corwin is busy travelling this week so might be slow to reply, but I can help answer this part. For some use cases it might be fine to have a single WriterProperties instance for all files, but there are a few reasons why you could want to generate new encryption properties per-file so need a factory:

  • To generate new random data encryption keys per file, as it's good security practice to limit how widely one key is used
  • To set a different AAD prefix per-file, which prevents attackers from being able to swap out encrypted modules between files and tamper with data
  • Be able to handle schema changes like adding columns while specifying per-column encryption keys
  • Enables use of external key material, which is when you write a JSON file alongside each Parquet file containing the key metadata. This allows rotation of master keys without having to re-write Parquet files, only the JSON files need to be rewritten. We don't implement this in the Rust parquet-key-management crate but it's supported by the Java and C++/Python Parquet implementations.

This also aligns with the DataFusion EncryptionFactory trait that takes the file path as a parameter when creating encryption and decryption properties.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

binding/rust Issues for the Rust crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants