refactor(query): Optimize FLATTEN
function with filter condition
#17892
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
I hereby agree to the terms of the CLA available at: https://docs.databend.com/dev/policies/cla/
Summary
This PR enhances the performance of the
FLATTEN
function in Databend, specifically addressing the issue of potential Out-of-Memory (OOM) errors caused by data replication.Background:
The
FLATTEN
function is a set-returning function that transforms nested JSON data into a tabular format. It generates six new columns:seq
,key
,path
,index
,value
, andthis
. Thethis
column contains the original JSON data. Due to the nature of set-returning functions, the number of rows can significantly increase, leading to substantial data duplication in thethis
column and potential OOM issues.PR #13935 introduced column pruning for
FLATTEN
, which avoids generating data for columns not explicitly selected by the user. However, this optimization was bypassed when the SQL query included filter conditions.This PR's Contribution:
This PR introduces filter condition matching to the
FLATTEN
function. Now, even when the SQL query contains filter statements, the column pruning optimization will still be applied. This means that thethis
column (and other unneeded columns) will not be populated with redundant data, even when filters are present, significantly reducing memory usage and improving query performance.Benefits:
Verification:
We can verify the effectiveness of this optimization by using the
EXPLAIN
command. The output offlatten
function have params to identify the columns that generated data. Params and columns have the following mapping relation.Example:
Tests
Type of change
This change is