Skip to content

Conversation

@ion-elgreco
Copy link
Collaborator

@ion-elgreco ion-elgreco commented Oct 19, 2025

Description

Collecting the streams concurrently and sending them over a sync channel provides the benefit of us keeping a single writer instance where we write through. This prevents the edge case of potentially writing many small files, when datafusion decided to lots of partitioned streams that are relatively small. Since we now maintain one writer we just keep writing from all partition streams into that.

I guess this stacks well with your PR @abhiaagarwal? Wdyt?

We now only create one file with the same code of the above issue instead of 3 small files:

{"protocol":{"minReaderVersion":1,"minWriterVersion":2}}
{"metaData":{"id":"0d1846c0-71e6-4c6d-aba1-5b9f8d752937","name":null,"description":null,"format":{"provider":"parquet","options":{}},"schemaString":"{\"type\":\"struct\",\"fields\":[{\"name\":\"foo\",\"type\":\"long\",\"nullable\":true,\"metadata\":{}}]}","partitionColumns":[],"createdTime":1760889081860,"configuration":{}}}
{"add":{"path":"part-00000-352305f0-eff1-44d3-993e-6bd31cdd118c-c000.snappy.parquet","partitionValues":{},"size":510,"modificationTime":1760889081877,"dataChange":true,"stats":"{\"numRecords\":9,\"minValues\":{\"foo\":1},\"maxValues\":{\"foo\":3},\"nullCount\":{\"foo\":0}}","tags":null,"baseRowId":null,"defaultRowCommitVersion":null,"clusteringProvider":null}}
{"commitInfo":{"timestamp":1760889081878,"operation":"WRITE","operationParameters":{"mode":"ErrorIfExists"},"engineInfo":"delta-rs:py-1.2.0","clientVersion":"delta-rs.py-1.2.0","operationMetrics":{"execution_time_ms":19,"num_added_files":1,"num_added_rows":9,"num_partitions":0,"num_removed_files":0}}}

@github-actions github-actions bot added the binding/python Issues for the Python package label Oct 19, 2025
@ion-elgreco ion-elgreco force-pushed the refactor/global-execution-writer branch from aaf88eb to 4e26be2 Compare October 19, 2025 14:50
@github-actions github-actions bot added binding/rust Issues for the Rust crate and removed binding/python Issues for the Python package labels Oct 19, 2025
@codecov
Copy link

codecov bot commented Oct 19, 2025

Codecov Report

❌ Patch coverage is 80.00000% with 35 lines in your changes missing coverage. Please review.
✅ Project coverage is 73.76%. Comparing base (d0ae849) to head (03d860d).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
crates/core/src/operations/write/execution.rs 80.00% 10 Missing and 25 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3870      +/-   ##
==========================================
- Coverage   73.81%   73.76%   -0.05%     
==========================================
  Files         151      151              
  Lines       39154    39168      +14     
  Branches    39154    39168      +14     
==========================================
- Hits        28903    28894       -9     
- Misses       8972     8983      +11     
- Partials     1279     1291      +12     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@ion-elgreco ion-elgreco force-pushed the refactor/global-execution-writer branch from 4e26be2 to 7263751 Compare October 19, 2025 14:54
Signed-off-by: Ion Koutsouris
<[email protected]> Signed-off-by: Ion
Koutsouris <[email protected]>
Signed-off-by: Ion Koutsouris
<[email protected]>
Signed-off-by: Ion Koutsouris <[email protected]>
@ion-elgreco ion-elgreco force-pushed the refactor/global-execution-writer branch from 7263751 to ba89671 Compare October 19, 2025 14:54
@abhiaagarwal
Copy link
Contributor

I like the concept of using a channel and it makes sense to me conceptually, can you bench it?

@ion-elgreco
Copy link
Collaborator Author

@abhiaagarwal yeah I am running the bench you've created in the background :)

@ion-elgreco
Copy link
Collaborator Author

@abhiaagarwal no performance difference, so that's good :)

------------------------------------------------------------------------------------------------- benchmark 'optimize': 8 tests -------------------------------------------------------------------------------------------------
Name (time in ms)                                          Min                   Max                  Mean             StdDev                Median                 IQR            Outliers     OPS            Rounds  Iterations
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_benchmark_optimize[5] (0002_channel)             719.7418 (1.0)        751.5428 (1.02)       733.8279 (1.00)     12.1159 (2.13)       733.1993 (1.00)      16.8272 (2.14)          2;0  1.3627 (1.00)          5           1
test_benchmark_optimize[5] (0001_main)                721.7502 (1.00)       736.3818 (1.0)        730.2095 (1.0)       5.7009 (1.0)        732.3502 (1.0)        7.8672 (1.0)           2;0  1.3695 (1.0)           5           1
test_benchmark_optimize[1] (0001_main)              2,206.5672 (3.07)     2,355.7690 (3.20)     2,258.7503 (3.09)     64.7090 (11.35)    2,221.4170 (3.03)      97.2196 (12.36)         1;0  0.4427 (0.32)          5           1
test_benchmark_optimize[1] (0002_channel)           2,263.9967 (3.15)     2,340.0274 (3.18)     2,306.7549 (3.16)     28.8090 (5.05)     2,313.2672 (3.16)      38.4981 (4.89)          2;0  0.4335 (0.32)          5           1
test_benchmark_optimize_minio[5] (0002_channel)     3,780.0758 (5.25)     3,960.5010 (5.38)     3,861.8949 (5.29)     72.8736 (12.78)    3,833.5507 (5.23)     111.8059 (14.21)         2;0  0.2589 (0.19)          5           1
test_benchmark_optimize_minio[5] (0001_main)        3,807.3613 (5.29)     4,045.7922 (5.49)     3,942.8511 (5.40)     99.9441 (17.53)    3,968.4321 (5.42)     167.3693 (21.27)         2;0  0.2536 (0.19)          5           1
test_benchmark_optimize_minio[1] (0002_channel)     5,955.1159 (8.27)     6,052.6467 (8.22)     6,004.6043 (8.22)     35.6177 (6.25)     5,998.3244 (8.19)      41.0871 (5.22)          2;0  0.1665 (0.12)          5           1
test_benchmark_optimize_minio[1] (0001_main)        6,066.0387 (8.43)     6,115.9411 (8.31)     6,090.9043 (8.34)     20.6738 (3.63)     6,093.9989 (8.32)      34.9651 (4.44)          2;0  0.1642 (0.12)          5           1
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

----------------------------------------------------------------------------------------------- benchmark 'write': 4 tests -----------------------------------------------------------------------------------------------
Name (time in ms)                                    Min                   Max                  Mean             StdDev                Median                IQR            Outliers     OPS            Rounds  Iterations
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_benchmark_write (0001_main)                250.2680 (1.0)        279.8725 (1.05)       265.9785 (1.03)     12.3082 (2.15)       265.6214 (1.04)     20.9681 (4.45)          2;0  3.7597 (0.97)          5           1
test_benchmark_write (0002_channel)             253.7139 (1.01)       267.1352 (1.0)        257.0089 (1.0)       5.7138 (1.0)        254.7807 (1.0)       4.7100 (1.0)           1;1  3.8909 (1.0)           5           1
test_benchmark_write_minio (0002_channel)     1,073.5712 (4.29)     1,129.8137 (4.23)     1,093.5961 (4.26)     22.8959 (4.01)     1,089.3257 (4.28)     32.2742 (6.85)          1;0  0.9144 (0.24)          5           1
test_benchmark_write_minio (0001_main)        1,087.7237 (4.35)     1,138.7201 (4.26)     1,107.5863 (4.31)     18.8371 (3.30)     1,103.6610 (4.33)     16.5907 (3.52)          2;0  0.9029 (0.23)          5           1
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Signed-off-by: Ion Koutsouris <[email protected]>
@github-actions github-actions bot added the binding/python Issues for the Python package label Oct 19, 2025
@ion-elgreco ion-elgreco changed the title refactor: use sync channel for write_execution_plan fix: use sync channel for write_execution_plan Oct 19, 2025
@ion-elgreco ion-elgreco changed the title fix: use sync channel for write_execution_plan fix: use single writer in write_execution_plan Oct 19, 2025
@ion-elgreco ion-elgreco changed the title fix: use single writer in write_execution_plan fix: use single writer for all partition streams Oct 19, 2025
@ion-elgreco ion-elgreco enabled auto-merge (squash) October 19, 2025 15:59
@abhiaagarwal
Copy link
Contributor

It seems to be a consistent minor improvement though, which is nice! I guess there should be a benchmark itself that measures the performance of writing on actual partitioned tables to get a real idea of performance.

@ion-elgreco
Copy link
Collaborator Author

It seems to be a consistent minor improvement though, which is nice! I guess there should be a benchmark itself that measures the performance of writing on actual partitioned tables to get a real idea of performance.

I quickly ran it with the way I created the issue repro, I didn't see much difference. So that is good, it only solves the small file size problem

@abhiaagarwal
Copy link
Contributor

The same file problem will be very nice for large compacts across multiple partitions, as I've noticed it creates a huge amount of small files due to Datafusion over-partitioning.

@ion-elgreco ion-elgreco merged commit f7688c4 into delta-io:main Oct 19, 2025
37 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

binding/python Issues for the Python package binding/rust Issues for the Rust crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Chunked dataframe creates small files

3 participants