fix: use single writer for all partition streams #3870

ion-elgreco · 2025-10-19T14:49:58Z

Description

Collecting the streams concurrently and sending them over a sync channel provides the benefit of us keeping a single writer instance where we write through. This prevents the edge case of potentially writing many small files, when datafusion decided to lots of partitioned streams that are relatively small. Since we now maintain one writer we just keep writing from all partition streams into that.

I guess this stacks well with your PR @abhiaagarwal? Wdyt?

closes [Bug]: Chunked dataframe creates small files #3871

We now only create one file with the same code of the above issue instead of 3 small files:

{"protocol":{"minReaderVersion":1,"minWriterVersion":2}}
{"metaData":{"id":"0d1846c0-71e6-4c6d-aba1-5b9f8d752937","name":null,"description":null,"format":{"provider":"parquet","options":{}},"schemaString":"{\"type\":\"struct\",\"fields\":[{\"name\":\"foo\",\"type\":\"long\",\"nullable\":true,\"metadata\":{}}]}","partitionColumns":[],"createdTime":1760889081860,"configuration":{}}}
{"add":{"path":"part-00000-352305f0-eff1-44d3-993e-6bd31cdd118c-c000.snappy.parquet","partitionValues":{},"size":510,"modificationTime":1760889081877,"dataChange":true,"stats":"{\"numRecords\":9,\"minValues\":{\"foo\":1},\"maxValues\":{\"foo\":3},\"nullCount\":{\"foo\":0}}","tags":null,"baseRowId":null,"defaultRowCommitVersion":null,"clusteringProvider":null}}
{"commitInfo":{"timestamp":1760889081878,"operation":"WRITE","operationParameters":{"mode":"ErrorIfExists"},"engineInfo":"delta-rs:py-1.2.0","clientVersion":"delta-rs.py-1.2.0","operationMetrics":{"execution_time_ms":19,"num_added_files":1,"num_added_rows":9,"num_partitions":0,"num_removed_files":0}}}

codecov · 2025-10-19T14:51:20Z

Codecov Report

❌ Patch coverage is 80.00000% with 35 lines in your changes missing coverage. Please review.
✅ Project coverage is 73.76%. Comparing base (d0ae849) to head (03d860d).
⚠️ Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
crates/core/src/operations/write/execution.rs	80.00%	10 Missing and 25 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3870      +/-   ##
==========================================
- Coverage   73.81%   73.76%   -0.05%     
==========================================
  Files         151      151              
  Lines       39154    39168      +14     
  Branches    39154    39168      +14     
==========================================
- Hits        28903    28894       -9     
- Misses       8972     8983      +11     
- Partials     1279     1291      +12

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Signed-off-by: Ion Koutsouris <[email protected]>

Signed-off-by: Ion Koutsouris <[email protected]> Signed-off-by: Ion Koutsouris <[email protected]> Signed-off-by: Ion Koutsouris <[email protected]> Signed-off-by: Ion Koutsouris <[email protected]>

abhiaagarwal · 2025-10-19T15:03:32Z

I like the concept of using a channel and it makes sense to me conceptually, can you bench it?

ion-elgreco · 2025-10-19T15:12:21Z

@abhiaagarwal yeah I am running the bench you've created in the background :)

ion-elgreco · 2025-10-19T15:40:07Z

@abhiaagarwal no performance difference, so that's good :)

------------------------------------------------------------------------------------------------- benchmark 'optimize': 8 tests -------------------------------------------------------------------------------------------------
Name (time in ms)                                          Min                   Max                  Mean             StdDev                Median                 IQR            Outliers     OPS            Rounds  Iterations
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_benchmark_optimize[5] (0002_channel)             719.7418 (1.0)        751.5428 (1.02)       733.8279 (1.00)     12.1159 (2.13)       733.1993 (1.00)      16.8272 (2.14)          2;0  1.3627 (1.00)          5           1
test_benchmark_optimize[5] (0001_main)                721.7502 (1.00)       736.3818 (1.0)        730.2095 (1.0)       5.7009 (1.0)        732.3502 (1.0)        7.8672 (1.0)           2;0  1.3695 (1.0)           5           1
test_benchmark_optimize[1] (0001_main)              2,206.5672 (3.07)     2,355.7690 (3.20)     2,258.7503 (3.09)     64.7090 (11.35)    2,221.4170 (3.03)      97.2196 (12.36)         1;0  0.4427 (0.32)          5           1
test_benchmark_optimize[1] (0002_channel)           2,263.9967 (3.15)     2,340.0274 (3.18)     2,306.7549 (3.16)     28.8090 (5.05)     2,313.2672 (3.16)      38.4981 (4.89)          2;0  0.4335 (0.32)          5           1
test_benchmark_optimize_minio[5] (0002_channel)     3,780.0758 (5.25)     3,960.5010 (5.38)     3,861.8949 (5.29)     72.8736 (12.78)    3,833.5507 (5.23)     111.8059 (14.21)         2;0  0.2589 (0.19)          5           1
test_benchmark_optimize_minio[5] (0001_main)        3,807.3613 (5.29)     4,045.7922 (5.49)     3,942.8511 (5.40)     99.9441 (17.53)    3,968.4321 (5.42)     167.3693 (21.27)         2;0  0.2536 (0.19)          5           1
test_benchmark_optimize_minio[1] (0002_channel)     5,955.1159 (8.27)     6,052.6467 (8.22)     6,004.6043 (8.22)     35.6177 (6.25)     5,998.3244 (8.19)      41.0871 (5.22)          2;0  0.1665 (0.12)          5           1
test_benchmark_optimize_minio[1] (0001_main)        6,066.0387 (8.43)     6,115.9411 (8.31)     6,090.9043 (8.34)     20.6738 (3.63)     6,093.9989 (8.32)      34.9651 (4.44)          2;0  0.1642 (0.12)          5           1
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

----------------------------------------------------------------------------------------------- benchmark 'write': 4 tests -----------------------------------------------------------------------------------------------
Name (time in ms)                                    Min                   Max                  Mean             StdDev                Median                IQR            Outliers     OPS            Rounds  Iterations
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_benchmark_write (0001_main)                250.2680 (1.0)        279.8725 (1.05)       265.9785 (1.03)     12.3082 (2.15)       265.6214 (1.04)     20.9681 (4.45)          2;0  3.7597 (0.97)          5           1
test_benchmark_write (0002_channel)             253.7139 (1.01)       267.1352 (1.0)        257.0089 (1.0)       5.7138 (1.0)        254.7807 (1.0)       4.7100 (1.0)           1;1  3.8909 (1.0)           5           1
test_benchmark_write_minio (0002_channel)     1,073.5712 (4.29)     1,129.8137 (4.23)     1,093.5961 (4.26)     22.8959 (4.01)     1,089.3257 (4.28)     32.2742 (6.85)          1;0  0.9144 (0.24)          5           1
test_benchmark_write_minio (0001_main)        1,087.7237 (4.35)     1,138.7201 (4.26)     1,107.5863 (4.31)     18.8371 (3.30)     1,103.6610 (4.33)     16.5907 (3.52)          2;0  0.9029 (0.23)          5           1
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Signed-off-by: Ion Koutsouris <[email protected]>

abhiaagarwal · 2025-10-19T16:00:13Z

It seems to be a consistent minor improvement though, which is nice! I guess there should be a benchmark itself that measures the performance of writing on actual partitioned tables to get a real idea of performance.

ion-elgreco · 2025-10-19T16:24:41Z

It seems to be a consistent minor improvement though, which is nice! I guess there should be a benchmark itself that measures the performance of writing on actual partitioned tables to get a real idea of performance.

I quickly ran it with the way I created the issue repro, I didn't see much difference. So that is good, it only solves the small file size problem

abhiaagarwal · 2025-10-19T17:26:23Z

The same file problem will be very nice for large compacts across multiple partitions, as I've noticed it creates a huge amount of small files due to Datafusion over-partitioning.

ion-elgreco requested review from hntd187, roeap and rtyler as code owners October 19, 2025 14:49

github-actions bot added the binding/python Issues for the Python package label Oct 19, 2025

ion-elgreco force-pushed the refactor/global-execution-writer branch from aaf88eb to 4e26be2 Compare October 19, 2025 14:50

github-actions bot added binding/rust Issues for the Rust crate and removed binding/python Issues for the Python package labels Oct 19, 2025

refactor: use single writer globally

f4c6321

Signed-off-by: Ion Koutsouris <[email protected]>

ion-elgreco force-pushed the refactor/global-execution-writer branch from 4e26be2 to 7263751 Compare October 19, 2025 14:54

refactor: use tokio sync channel to send batches through for the writer

ba89671

Signed-off-by: Ion Koutsouris <[email protected]> Signed-off-by: Ion Koutsouris <[email protected]> Signed-off-by: Ion Koutsouris <[email protected]> Signed-off-by: Ion Koutsouris <[email protected]>

ion-elgreco force-pushed the refactor/global-execution-writer branch from 7263751 to ba89671 Compare October 19, 2025 14:54

chore: bump version

03d860d

Signed-off-by: Ion Koutsouris <[email protected]>

github-actions bot added the binding/python Issues for the Python package label Oct 19, 2025

ion-elgreco changed the title ~~refactor: use sync channel for write_execution_plan~~ fix: use sync channel for write_execution_plan Oct 19, 2025

ion-elgreco changed the title ~~fix: use sync channel for write_execution_plan~~ fix: use single writer in write_execution_plan Oct 19, 2025

ion-elgreco changed the title ~~fix: use single writer in write_execution_plan~~ fix: use single writer for all partition streams Oct 19, 2025

ion-elgreco enabled auto-merge (squash) October 19, 2025 15:59

roeap approved these changes Oct 19, 2025

View reviewed changes

ion-elgreco merged commit f7688c4 into delta-io:main Oct 19, 2025
37 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

fix: use single writer for all partition streams #3870

fix: use single writer for all partition streams #3870

Uh oh!

ion-elgreco commented Oct 19, 2025 •

edited

Loading

Uh oh!

codecov bot commented Oct 19, 2025 •

edited

Loading

Uh oh!

abhiaagarwal commented Oct 19, 2025

Uh oh!

ion-elgreco commented Oct 19, 2025

Uh oh!

ion-elgreco commented Oct 19, 2025

Uh oh!

abhiaagarwal commented Oct 19, 2025

Uh oh!

ion-elgreco commented Oct 19, 2025

Uh oh!

abhiaagarwal commented Oct 19, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

fix: use single writer for all partition streams #3870

fix: use single writer for all partition streams #3870

Uh oh!

Conversation

ion-elgreco commented Oct 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Uh oh!

codecov bot commented Oct 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

abhiaagarwal commented Oct 19, 2025

Uh oh!

ion-elgreco commented Oct 19, 2025

Uh oh!

ion-elgreco commented Oct 19, 2025

Uh oh!

abhiaagarwal commented Oct 19, 2025

Uh oh!

ion-elgreco commented Oct 19, 2025

Uh oh!

abhiaagarwal commented Oct 19, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ion-elgreco commented Oct 19, 2025 •

edited

Loading

codecov bot commented Oct 19, 2025 •

edited

Loading