-
Couldn't load subscription status.
- Fork 537
fix: use single writer for all partition streams #3870
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: use single writer for all partition streams #3870
Conversation
aaf88eb to
4e26be2
Compare
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3870 +/- ##
==========================================
- Coverage 73.81% 73.76% -0.05%
==========================================
Files 151 151
Lines 39154 39168 +14
Branches 39154 39168 +14
==========================================
- Hits 28903 28894 -9
- Misses 8972 8983 +11
- Partials 1279 1291 +12 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Signed-off-by: Ion Koutsouris <[email protected]>
4e26be2 to
7263751
Compare
Signed-off-by: Ion Koutsouris <[email protected]> Signed-off-by: Ion Koutsouris <[email protected]> Signed-off-by: Ion Koutsouris <[email protected]> Signed-off-by: Ion Koutsouris <[email protected]>
7263751 to
ba89671
Compare
|
I like the concept of using a channel and it makes sense to me conceptually, can you bench it? |
|
@abhiaagarwal yeah I am running the bench you've created in the background :) |
|
@abhiaagarwal no performance difference, so that's good :) |
Signed-off-by: Ion Koutsouris <[email protected]>
|
It seems to be a consistent minor improvement though, which is nice! I guess there should be a benchmark itself that measures the performance of writing on actual partitioned tables to get a real idea of performance. |
I quickly ran it with the way I created the issue repro, I didn't see much difference. So that is good, it only solves the small file size problem |
|
The same file problem will be very nice for large compacts across multiple partitions, as I've noticed it creates a huge amount of small files due to Datafusion over-partitioning. |
Description
Collecting the streams concurrently and sending them over a sync channel provides the benefit of us keeping a single writer instance where we write through. This prevents the edge case of potentially writing many small files, when datafusion decided to lots of partitioned streams that are relatively small. Since we now maintain one writer we just keep writing from all partition streams into that.
I guess this stacks well with your PR @abhiaagarwal? Wdyt?
We now only create one file with the same code of the above issue instead of 3 small files:
{"protocol":{"minReaderVersion":1,"minWriterVersion":2}} {"metaData":{"id":"0d1846c0-71e6-4c6d-aba1-5b9f8d752937","name":null,"description":null,"format":{"provider":"parquet","options":{}},"schemaString":"{\"type\":\"struct\",\"fields\":[{\"name\":\"foo\",\"type\":\"long\",\"nullable\":true,\"metadata\":{}}]}","partitionColumns":[],"createdTime":1760889081860,"configuration":{}}} {"add":{"path":"part-00000-352305f0-eff1-44d3-993e-6bd31cdd118c-c000.snappy.parquet","partitionValues":{},"size":510,"modificationTime":1760889081877,"dataChange":true,"stats":"{\"numRecords\":9,\"minValues\":{\"foo\":1},\"maxValues\":{\"foo\":3},\"nullCount\":{\"foo\":0}}","tags":null,"baseRowId":null,"defaultRowCommitVersion":null,"clusteringProvider":null}} {"commitInfo":{"timestamp":1760889081878,"operation":"WRITE","operationParameters":{"mode":"ErrorIfExists"},"engineInfo":"delta-rs:py-1.2.0","clientVersion":"delta-rs.py-1.2.0","operationMetrics":{"execution_time_ms":19,"num_added_files":1,"num_added_rows":9,"num_partitions":0,"num_removed_files":0}}}