refactor: async writer + multi-part #3255

ion-elgreco · 2025-02-23T16:13:11Z

Description

Changed to use arrows async writer and use multi-part uploads.

Related Issue(s)

closes Use async writer + multipart + explore Datafusion sink #1984

codecov · 2025-02-23T16:29:37Z

Codecov Report

Attention: Patch coverage is 64.89362% with 33 lines in your changes missing coverage. Please review.

Project coverage is 72.13%. Comparing base (19fb4a1) to head (e096e70).
Report is 2 commits behind head on main.

Files with missing lines	Patch %	Lines
crates/core/src/operations/write/writer.rs	62.00%	8 Missing and 11 partials ⚠️
crates/core/src/operations/write/async_utils.rs	68.18%	13 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3255      +/-   ##
==========================================
- Coverage   72.14%   72.13%   -0.02%     
==========================================
  Files         143      144       +1     
  Lines       45668    45748      +80     
  Branches    45668    45748      +80     
==========================================
+ Hits        32949    33002      +53     
- Misses      10634    10656      +22     
- Partials     2085     2090       +5

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

rtyler · 2025-02-25T02:58:59Z

crates/core/src/operations/write/writer.rs

-        self.object_store.put(&path, buffer.into()).await?;
+        let mut multi_part_upload = self.object_store.put_multipart(&path).await?;
+        let part_size = upload_part_size();
+        let mut tasks = JoinSet::new();
+        let max_concurrent_tasks = 10; // TODO: make configurable


This is really such a major behavior change. I am not terribly familiar with the maturity level of multipart uploads in object_store I don't think this is necessarily a bad change, but I am doubtful of this addressing the originally linked issue.

As best as I can tell the buffers are still going to fill up memory until the flush, and then the flush is going to fan out to have parallel uploads

flowchart LR write --> buffer_batch; write --> buffer_batch; write --> buffer_batch; buffer_batch --> flush; flush --> p1; flush --> p2; flush --> p3; flush --> p4; p1 --> close; p2 --> close; p3 --> close; p4 --> close;

Loading

Good point, in its current form we indeed still buffer longer until the flush, but the buffering and writing has more parallelism now, so it should be faster.

We could do two things here btw:

release this as is for people to experiment with (maybe after 1.0)

iterate on this to also flush after the min - part size is available, also create some benchmark tests

cjolowicz · 2025-03-02T11:09:09Z

@ion-elgreco In reply to your comment on #3157 , yes I think this would likely speed up the upload phase of z-ordering. Additionally, we've been watching this with interest because it would unblock compaction to target sizes > 5GB, which is the upper limit for single part uploads in R2.

ion-elgreco · 2025-03-02T11:16:23Z

@cjolowicz maybe give it a spin, I don't have any environments or datasets to actually test the performance myself

Signed-off-by: Ion Koutsouris <[email protected]>

rtyler

I remain skeptical we will see substantive performance improvements here, but willing to try in the next release 😈

ion-elgreco requested review from hntd187, roeap, rtyler and wjones127 as code owners February 23, 2025 16:13

github-actions bot added the binding/rust Issues for the Rust crate label Feb 23, 2025

ion-elgreco force-pushed the refactor/writer-async branch 2 times, most recently from a1787c9 to 2353a21 Compare February 23, 2025 16:20

rtyler reviewed Feb 25, 2025

View reviewed changes

ion-elgreco mentioned this pull request Mar 2, 2025

feat: enable file merging by last modification time using preserve-insertion-order #3157

Open

ion-elgreco added 2 commits March 8, 2025 20:19

chore: move writer to write module

17ce84f

Signed-off-by: Ion Koutsouris <[email protected]>

refactor: async writer + multipart

e096e70

Signed-off-by: Ion Koutsouris <[email protected]>

rtyler force-pushed the refactor/writer-async branch from d9e79c5 to e096e70 Compare March 8, 2025 20:20

rtyler enabled auto-merge March 8, 2025 20:22

rtyler approved these changes Mar 8, 2025

View reviewed changes

rtyler added this pull request to the merge queue Mar 8, 2025

Merged via the queue into delta-io:main with commit 45a9ac1 Mar 8, 2025
25 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

refactor: async writer + multi-part #3255

refactor: async writer + multi-part #3255

Uh oh!

ion-elgreco commented Feb 23, 2025

Uh oh!

codecov bot commented Feb 23, 2025 •

edited

Loading

Uh oh!

rtyler Feb 25, 2025

Uh oh!

ion-elgreco Feb 25, 2025 •

edited

Loading

Uh oh!

cjolowicz commented Mar 2, 2025

Uh oh!

ion-elgreco commented Mar 2, 2025

Uh oh!

rtyler left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

refactor: async writer + multi-part #3255

refactor: async writer + multi-part #3255

Uh oh!

Conversation

ion-elgreco commented Feb 23, 2025

Description

Related Issue(s)

Uh oh!

codecov bot commented Feb 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

rtyler Feb 25, 2025

Choose a reason for hiding this comment

Uh oh!

ion-elgreco Feb 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cjolowicz commented Mar 2, 2025

Uh oh!

ion-elgreco commented Mar 2, 2025

Uh oh!

rtyler left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov bot commented Feb 23, 2025 •

edited

Loading

ion-elgreco Feb 25, 2025 •

edited

Loading