-
Notifications
You must be signed in to change notification settings - Fork 475
refactor: async writer + multi-part #3255
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
a1787c9
to
2353a21
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3255 +/- ##
==========================================
- Coverage 72.14% 72.13% -0.02%
==========================================
Files 143 144 +1
Lines 45668 45748 +80
Branches 45668 45748 +80
==========================================
+ Hits 32949 33002 +53
- Misses 10634 10656 +22
- Partials 2085 2090 +5 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
self.object_store.put(&path, buffer.into()).await?; | ||
let mut multi_part_upload = self.object_store.put_multipart(&path).await?; | ||
let part_size = upload_part_size(); | ||
let mut tasks = JoinSet::new(); | ||
let max_concurrent_tasks = 10; // TODO: make configurable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really such a major behavior change. I am not terribly familiar with the maturity level of multipart uploads in object_store
I don't think this is necessarily a bad change, but I am doubtful of this addressing the originally linked issue.
As best as I can tell the buffers are still going to fill up memory until the flush, and then the flush is going to fan out to have parallel uploads
flowchart LR
write --> buffer_batch;
write --> buffer_batch;
write --> buffer_batch;
buffer_batch --> flush;
flush --> p1;
flush --> p2;
flush --> p3;
flush --> p4;
p1 --> close;
p2 --> close;
p3 --> close;
p4 --> close;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, in its current form we indeed still buffer longer until the flush, but the buffering and writing has more parallelism now, so it should be faster.
We could do two things here btw:
- release this as is for people to experiment with (maybe after 1.0)
- iterate on this to also flush after the min - part size is available, also create some benchmark tests
@ion-elgreco In reply to your comment on #3157 , yes I think this would likely speed up the upload phase of z-ordering. Additionally, we've been watching this with interest because it would unblock compaction to target sizes > 5GB, which is the upper limit for single part uploads in R2. |
@cjolowicz maybe give it a spin, I don't have any environments or datasets to actually test the performance myself |
Signed-off-by: Ion Koutsouris <[email protected]>
Signed-off-by: Ion Koutsouris <[email protected]>
d9e79c5
to
e096e70
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remain skeptical we will see substantive performance improvements here, but willing to try in the next release 😈
Description
Changed to use arrows async writer and use multi-part uploads.
Related Issue(s)