fix: miss output ordering during projection #15683

xudong963 · 2025-04-11T10:32:25Z

Which issue does this PR close?

Related my use case

Rationale for this change

Currently, the base_config.output_ordering is specified by the external mechanism; the users usually decide on the output_ordering. See here: https://github.com/apache/datafusion/blob/main/datafusion/core/src/datasource/listing/table.rs#L273-L277

That is, users should ensure that the output ordering is correct.

Most users may not enable the collect_statistic config, so they will miss output ordering here. Even if the config is enabled, the statistic may be inaccurate, or for some file formats, they don't have statistics. If users write the output_ordering to config, I guess they think/believe their data is sorted, but the check in get_projected_output_ordering will drop the output ordering for them.

What changes are included in this PR?

Still do the check, but only output warning.

Are these changes tested?

Are there any user-facing changes?

xudong963 · 2025-04-11T11:06:36Z

I don't understand the failing sqllogictest:

# Check output plan again, expect no "output_ordering" clause in the physical_plan -> ParquetExec,
# due to there being more files than partitions:
query TT
EXPLAIN SELECT int_col, string_col
FROM test_table
ORDER BY string_col, int_col;
----
logical_plan
01)Sort: test_table.string_col ASC NULLS LAST, test_table.int_col ASC NULLS LAST
02)--TableScan: test_table projection=[int_col, string_col]
physical_plan
01)SortPreservingMergeExec: [string_col@1 ASC NULLS LAST, int_col@0 ASC NULLS LAST]
02)--SortExec: expr=[string_col@1 ASC NULLS LAST, int_col@0 ASC NULLS LAST], preserve_partitioning=[true]
03)----DataSourceExec: file_groups={2 groups: [[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/parquet/test_table/0.parquet, WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/parquet/test_table/1.parquet], [WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/parquet/test_table/2.parquet]]}, projection=[int_col, string_col], file_type=parquet

Based on current code and doc, my understanding is that the output_ordering, aka. file_sort_order refers to the file order, that is, if we specify the output_ordering, what we can ensure is that the data in a single file is ordered. https://datafusion.apache.org/user-guide/sql/ddl.html#cautions-when-using-the-with-order-clause

So why expect no "output_ordering" clause in the physical_plan -> ParquetExec due to there being more files than partitions? 🤔

suremarc · 2025-04-14T05:37:45Z

That is, users should ensure that the output ordering is correct.

One of the users as of now is ListingTable, which I don't believe makes any such guarantees, so we would have to fix ListingTable to ensure the output_ordering is correct. But even if we did, I would worry about breaking third-party users of FileScanConfig that rely on DataFusion to perform this check.

Based on current code and doc, my understanding is that the output_ordering, aka. file_sort_order refers to the file order, that is, if we specify the output_ordering, what we can ensure is that the data in a single file is ordered. https://datafusion.apache.org/user-guide/sql/ddl.html#cautions-when-using-the-with-order-clause

So why expect no "output_ordering" clause in the physical_plan -> ParquetExec due to there being more files than partitions? 🤔

In the past we only added an output_ordering to the plan if each file group had at most 1 file. Otherwise concatenating files isn't guaranteed to preserve order. Later we relaxed this by looking at statistics to see if the files are nonoverlapping and ordered with respect to the sort keys.

I do think the current behavior is a potential pain point, because sometimes users know their files are ordered and non-overlapping, but can't get DataFusion to avoid the sort. However, this has been the behavior in DataFusion for a long time (at least the 2 years I have been using it), so if we were to change it, I would prefer if we do it in a way that makes this behavior change very obvious.

xudong963 · 2025-04-14T09:25:31Z

I do think the current behavior is a potential pain point, because sometimes users know their files are ordered and non-overlapping, but can't get DataFusion to avoid the sort. However, this has been the behavior in DataFusion for a long time (at least the 2 years I have been using it), so if we were to change it, I would prefer if we do it in a way that makes this behavior change very obvious.

How about adding a new config, such as check_file_group_ordering, and make it true as default? In this way, we won't change the current behavior. But I'm concerned that if users only wanna one of the tables to disable the check, they need to implement their own table configuration or something else to avoid the config influencing the global. Also cc @alamb

xudong963 · 2025-04-16T03:26:00Z

Also, I noticed the output_ordering here: https://github.com/apache/datafusion/blob/main/datafusion/datasource/src/file_scan_config.rs#L807 isn't checked.

Maybe we can extract the check logic to a method and add a check for the output_ordering too.

github-actions · 2025-06-16T02:13:10Z

Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days.

fix: miss output ordering during projection

42c2e79

github-actions bot added the datasource Changes to the datasource crate label Apr 11, 2025

xudong963 marked this pull request as draft April 11, 2025 10:41

github-actions bot added the Stale PR has not had any activity for some time label Jun 16, 2025

xudong963 closed this Jun 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: miss output ordering during projection #15683

fix: miss output ordering during projection #15683

Uh oh!

xudong963 commented Apr 11, 2025 •

edited

Loading

Uh oh!

xudong963 commented Apr 11, 2025

Uh oh!

suremarc commented Apr 14, 2025 •

edited

Loading

Uh oh!

xudong963 commented Apr 14, 2025

Uh oh!

xudong963 commented Apr 16, 2025

Uh oh!

github-actions bot commented Jun 16, 2025

Uh oh!

Uh oh!

fix: miss output ordering during projection #15683

fix: miss output ordering during projection #15683

Uh oh!

Conversation

xudong963 commented Apr 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

xudong963 commented Apr 11, 2025

Uh oh!

suremarc commented Apr 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xudong963 commented Apr 14, 2025

Uh oh!

xudong963 commented Apr 16, 2025

Uh oh!

github-actions bot commented Jun 16, 2025

Uh oh!

Uh oh!

xudong963 commented Apr 11, 2025 •

edited

Loading

suremarc commented Apr 14, 2025 •

edited

Loading