Skip to content

fix: miss output ordering during projection #15683

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

xudong963
Copy link
Member

@xudong963 xudong963 commented Apr 11, 2025

Which issue does this PR close?

  • Related my use case

Rationale for this change

Currently, the base_config.output_ordering is specified by the external mechanism; the users usually decide on the output_ordering. See here: https://github.com/apache/datafusion/blob/main/datafusion/core/src/datasource/listing/table.rs#L273-L277

That is, users should ensure that the output ordering is correct.

Most users may not enable the collect_statistic config, so they will miss output ordering here. Even if the config is enabled, the statistic may be inaccurate, or for some file formats, they don't have statistics. If users write the output_ordering to config, I guess they think/believe their data is sorted, but the check in get_projected_output_ordering will drop the output ordering for them.

What changes are included in this PR?

Still do the check, but only output warning.

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added the datasource Changes to the datasource crate label Apr 11, 2025
@xudong963 xudong963 marked this pull request as draft April 11, 2025 10:41
@xudong963
Copy link
Member Author

I don't understand the failing sqllogictest:

# Check output plan again, expect no "output_ordering" clause in the physical_plan -> ParquetExec,
# due to there being more files than partitions:
query TT
EXPLAIN SELECT int_col, string_col
FROM test_table
ORDER BY string_col, int_col;
----
logical_plan
01)Sort: test_table.string_col ASC NULLS LAST, test_table.int_col ASC NULLS LAST
02)--TableScan: test_table projection=[int_col, string_col]
physical_plan
01)SortPreservingMergeExec: [string_col@1 ASC NULLS LAST, int_col@0 ASC NULLS LAST]
02)--SortExec: expr=[string_col@1 ASC NULLS LAST, int_col@0 ASC NULLS LAST], preserve_partitioning=[true]
03)----DataSourceExec: file_groups={2 groups: [[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/parquet/test_table/0.parquet, WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/parquet/test_table/1.parquet], [WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/parquet/test_table/2.parquet]]}, projection=[int_col, string_col], file_type=parquet

Based on current code and doc, my understanding is that the output_ordering, aka. file_sort_order refers to the file order, that is, if we specify the output_ordering, what we can ensure is that the data in a single file is ordered. https://datafusion.apache.org/user-guide/sql/ddl.html#cautions-when-using-the-with-order-clause

So why expect no "output_ordering" clause in the physical_plan -> ParquetExec due to there being more files than partitions? 🤔

@suremarc
Copy link
Contributor

suremarc commented Apr 14, 2025

That is, users should ensure that the output ordering is correct.

One of the users as of now is ListingTable, which I don't believe makes any such guarantees, so we would have to fix ListingTable to ensure the output_ordering is correct. But even if we did, I would worry about breaking third-party users of FileScanConfig that rely on DataFusion to perform this check.

Based on current code and doc, my understanding is that the output_ordering, aka. file_sort_order refers to the file order, that is, if we specify the output_ordering, what we can ensure is that the data in a single file is ordered. https://datafusion.apache.org/user-guide/sql/ddl.html#cautions-when-using-the-with-order-clause

So why expect no "output_ordering" clause in the physical_plan -> ParquetExec due to there being more files than partitions? 🤔

In the past we only added an output_ordering to the plan if each file group had at most 1 file. Otherwise concatenating files isn't guaranteed to preserve order. Later we relaxed this by looking at statistics to see if the files are nonoverlapping and ordered with respect to the sort keys.

I do think the current behavior is a potential pain point, because sometimes users know their files are ordered and non-overlapping, but can't get DataFusion to avoid the sort. However, this has been the behavior in DataFusion for a long time (at least the 2 years I have been using it), so if we were to change it, I would prefer if we do it in a way that makes this behavior change very obvious.

@xudong963
Copy link
Member Author

I do think the current behavior is a potential pain point, because sometimes users know their files are ordered and non-overlapping, but can't get DataFusion to avoid the sort. However, this has been the behavior in DataFusion for a long time (at least the 2 years I have been using it), so if we were to change it, I would prefer if we do it in a way that makes this behavior change very obvious.

How about adding a new config, such as check_file_group_ordering, and make it true as default? In this way, we won't change the current behavior. But I'm concerned that if users only wanna one of the tables to disable the check, they need to implement their own table configuration or something else to avoid the config influencing the global. Also cc @alamb

@xudong963
Copy link
Member Author

Also, I noticed the output_ordering here: https://github.com/apache/datafusion/blob/main/datafusion/datasource/src/file_scan_config.rs#L807 isn't checked.

Maybe we can extract the check logic to a method and add a check for the output_ordering too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datasource Changes to the datasource crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants