feat: Add `write_parquet_file` to `ParquetHandler` #1392

Fokko · 2025-10-10T09:49:06Z

What changes are proposed in this pull request?

Hey everyone, this is a first PR to start the discussion around writing Parquet files.

Currently, the way to write Parquet is to completely delegate this to the engine, for example here: https://github.com/dl-rs-private/delta-kernel-rs/blob/a096d013f876ed29beef9379cf4cd713e9febd90/kernel/src/checkpoint/mod.rs#L44

Some things to consider:

In the DefaultParquetHandler there is write_parquet:

delta-kernel-rs/kernel/src/engine/default/parquet.rs

Lines 142 to 151 in 29a934a

    
           // Write `data` to `{path}/<uuid>.parquet` as parquet using ArrowWriter and return the parquet 
        
           // metadata (where `<uuid>` is a generated UUIDv4). 
        
           // 
        
           // Note: after encoding the data as parquet, this issues a PUT followed by a HEAD to storage in 
        
           // order to obtain metadata about the object just written. 
        
           async fn write_parquet( 
        
               &self, 
        
               path: &url::Url, 
        
               data: Box<dyn EngineData>, 
        
           ) -> DeltaResult<DataFileMetadata> {

But this one is very much focussed on writing DataFiles. This is not something we really need if we want to generic Parquet (for example a checkpoint).

I've started with () as a return type so we can extend that later on. We could also return things like the size, but that would introduce another HEAD request, which we need to consider if that's something we really need.
Now it writes everything into a single Parquet file. We could also make it more fancy and have a ParquetWriter that that consumes batches of EngineData. For the snapshot, this is not a requirement.

Resolves #1376

This PR affects the following public APIs

Introduces a new public API, and extends an existing trait.

How was this change tested?

codecov · 2025-10-10T09:51:37Z

Codecov Report

❌ Patch coverage is 93.61022% with 20 lines in your changes missing coverage. Please review.
✅ Project coverage is 84.75%. Comparing base (2ec1462) to head (904c01e).

Files with missing lines	Patch %	Lines
kernel/src/engine/sync/parquet.rs	93.87%	0 Missing and 9 partials ⚠️
kernel/src/engine/default/parquet.rs	95.30%	0 Missing and 7 partials ⚠️
kernel/src/engine/arrow_utils.rs	76.47%	0 Missing and 4 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1392      +/-   ##
==========================================
+ Coverage   84.65%   84.75%   +0.09%     
==========================================
  Files         115      115              
  Lines       29557    29858     +301     
  Branches    29557    29858     +301     
==========================================
+ Hits        25021    25305     +284     
  Misses       3329     3329              
- Partials     1207     1224      +17

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

anoopj · 2025-10-10T20:07:43Z

kernel/src/lib.rs

+    /// # Returns
+    ///
+    /// A [`DeltaResult`] indicating success or failure.
+    fn write_parquet_file(&self, url: url::Url, data: Box<dyn EngineData>) -> DeltaResult<()>;


Would it make sense for the API to take in write options: e.g. compression, row group size etc?

Compression would make sense to me, but row group size is often more complex. Some writers take the number of rows, while others take the size in bytes. Instead, we can also let the engine decide on this?

anoopj · 2025-10-10T20:20:39Z

kernel/src/lib.rs

+    /// # Returns
+    ///
+    /// A [`DeltaResult`] indicating success or failure.
+    fn write_parquet_file(&self, url: url::Url, data: Box<dyn EngineData>) -> DeltaResult<()>;


Shouldn't data be iterator of FilteredEngineData, because this is what checkpoint producer produces.

Box<dyn Iterator<Item = DeltaResult<FilteredEngineData>>

After digging a bit more into the code, I think this makes sense. Having the writer do filtering was not directly obvious to me, but it looks like we are also delegating that to the engine.

Yeah, it avoids a copy in cases where the kernel has to filter out some rows. Also consistent with the existing JSON write API.

Yes, I agree. I think it would be nice to have the convenience From trait to convert EngineData into FilteredEngineData: #1397

kernel/src/engine/default/parquet.rs

hntd187 · 2025-10-13T14:13:00Z

kernel/src/engine/sync/parquet.rs

        read_files(files, schema, predicate, try_create_from_parquet)
    }
+
+    fn write_parquet_file(


A lot of the machinery in this and the default client looks the same, just rip these out into a pub(crate) fn so they use the same logic. Or was there is a reason they are separate?

I think they are different, and it would be good to keep them separate:

SyncParquetHandler writes directly to a file, which makes sense since it only supports local FS.

DefaultParquetHandler buffers first the memory in memory, and then pushes everything directly to the object store.

They look pretty similar today, but I think they might diverge more in the future when we start doing more optimizations.

gotocoding-DB · 2025-10-15T16:02:00Z

kernel/src/lib.rs

+    fn write_parquet_file(
+        &self,
+        url: url::Url,
+        data: Box<dyn Iterator<Item = DeltaResult<FilteredEngineData>> + Send + '_>,


Why not just data: Box<dyn EngineData>?

Why Iterator?

Why FilteredEngineData?

Why Iterator

For future proving, you could write chunks of data that are larger than the memory. By having an iterator, you can stream this into Parquet file. Arrow has a similar concept where you have ChunkedArray, where each chunk is a row-group. I think we want to mimic that a bit here.

Why FilteredEngineData

This was also not my first thought (see #1392 (comment)). But this nicely aligns with the JSON API. To make the syntax a bit more friendly, and reduce the visual noise, I've suggested implementing the From trait to easily convert EngineData into FilteredEngineData: #1397.

Maybe I'm missing something, but I think we still need to go through all batches to identify the final schema. We're not guaranteed that it will be the same for all actions. Also, I believe you are joining everything to the memory object here: https://github.com/delta-io/delta-kernel-rs/pull/1392/files#diff-e05e7b3b94c5637bfc367192986135a7a8a3986c34dc1b22cfd4961647ce7664R64, so we still haven't addressed the potential problem of "it might not fit into memory".

With FilteredEngineData, I see we can use "selection_vector" – this is a good feature, I agree. Are we aware of cases when we can use it currently?

With FilteredEngineData, I see we can use "selection_vector" – this is a good feature, I agree. Are we aware of cases when we can use it currently?

We use it in the proposed remove_files PR (#1390)

Thanks for the pointer. For JSON the schema can change per row, but this can't be the case for Parquet. I've updated the code to remove the iterator for now. @anoopj WDYT?

kernel/src/lib.rs

…nto fd-write-parquet

emkornfield · 2025-10-16T17:36:29Z

kernel/src/engine/default/parquet.rs

+            .try_collect()
+            .unwrap();
+
+        assert_eq!(data.len(), 1);


I think we also need verify that field-id's are populated and we can project based on field IDs for column indirection? Will this be a follow-up?

gotocoding-DB · 2025-10-20T14:23:21Z

I'm not delta-kernel-rs maintainer, but from my POV this PR looks good.
P.S. Thanks for doing this, Fokko! We need this to simplify checkpoints.

nicklan · 2025-10-24T23:41:25Z

kernel/src/lib.rs

        predicate: Option<PredicateRef>,
    ) -> DeltaResult<FileDataReadResultIterator>;
+
+    /// Write data to a Parquet file at the specified URL.


we should specify the semantics around what to do if the file already exists

nicklan · 2025-10-24T23:51:38Z

kernel/src/engine/default/parquet.rs

+        // Convert FilteredEngineData to RecordBatch, applying selection filter
+        let batch = filter_to_record_batch(data)?;
+
+        // We buffer it in the application first, and then push everything to the object-store.


We should use the async_writer

As they note on that page: object_store provides it’s native implementation of AsyncFileWriter by ParquetObjectWriter.

So you could do something like:

let path = Path::from_url_path(location.path())?; let object_writer = ParquetObjectWriter::new(self.store.clone(), path); let mut writer = AsyncArrowWriter::try_new( object_writer, batch.schema(), None, // could be some props if needed )?; // Block on the async write self.task_executor .block_on(async move { writer.write(&batch).await })?;

Fokko added the breaking-change Change that require a major version bump label Oct 10, 2025

github-actions bot assigned Fokko Oct 10, 2025

feat: Add write_parquet_file to ParquetHandler

ce1bb10

Fokko force-pushed the fd-write-parquet branch from c7aacf2 to ce1bb10 Compare October 10, 2025 09:54

anoopj reviewed Oct 10, 2025

View reviewed changes

Change data to an Iterator of FilteredEngineData

4c86db5

hntd187 reviewed Oct 13, 2025

View reviewed changes

kernel/src/engine/default/parquet.rs Outdated Show resolved Hide resolved

hntd187 reviewed Oct 13, 2025

View reviewed changes

kernel/src/engine/default/parquet.rs Outdated Show resolved Hide resolved

hntd187 reviewed Oct 13, 2025

View reviewed changes

Fokko added 2 commits October 14, 2025 09:52

Thanks Stephen

fdca7ae

Merge branch 'main' into fd-write-parquet

9dde642

Fokko closed this Oct 14, 2025

Fokko deleted the fd-write-parquet branch October 14, 2025 20:40

Fokko restored the fd-write-parquet branch October 14, 2025 21:29

Fokko reopened this Oct 14, 2025

nicklan self-requested a review October 14, 2025 23:17

gotocoding-DB reviewed Oct 15, 2025

View reviewed changes

kernel/src/lib.rs Outdated Show resolved Hide resolved

Fokko added 7 commits October 15, 2025 21:44

Merge branch 'main' into fd-write-parquet

4279a4b

Merge branch 'main' into fd-write-parquet

10fee3c

Fix tests and remove empty check

6785266

Fix tests and remove empty check

b55a603

Filter the batch

1cad8e2

Merge branch 'fd-write-parquet' of github.com:Fokko/delta-kernel-rs i…

72b7e49

…nto fd-write-parquet

Thanks Micah!

425bd1f

Fokko requested a review from emkornfield October 16, 2025 14:00

emkornfield reviewed Oct 16, 2025

View reviewed changes

Merge branch 'main' into fd-write-parquet

f6004fa

Thanks Ruzel

904c01e

nicklan reviewed Oct 24, 2025

View reviewed changes

	// Write `data` to `{path}/<uuid>.parquet` as parquet using ArrowWriter and return the parquet
	// metadata (where `<uuid>` is a generated UUIDv4).
	//
	// Note: after encoding the data as parquet, this issues a PUT followed by a HEAD to storage in
	// order to obtain metadata about the object just written.
	async fn write_parquet(
	&self,
	path: &url::Url,
	data: Box<dyn EngineData>,
	) -> DeltaResult<DataFileMetadata> {

Uh oh!

feat: Add write_parquet_file to ParquetHandler #1392

Are you sure you want to change the base?

feat: Add write_parquet_file to ParquetHandler #1392

Uh oh!

Conversation

Fokko commented Oct 10, 2025

What changes are proposed in this pull request?

This PR affects the following public APIs

How was this change tested?

Uh oh!

codecov bot commented Oct 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gotocoding-DB commented Oct 20, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

feat: Add `write_parquet_file` to `ParquetHandler` #1392

feat: Add `write_parquet_file` to `ParquetHandler` #1392

codecov bot commented Oct 10, 2025 •

edited

Loading