refactor(query): Optimize `FLATTEN` function with filter condition #17892

b41sh · 2025-05-07T07:50:38Z

I hereby agree to the terms of the CLA available at: https://docs.databend.com/dev/policies/cla/

Summary

This PR enhances the performance of the FLATTEN function in Databend, specifically addressing the issue of potential Out-of-Memory (OOM) errors caused by data replication.

Background:

The FLATTEN function is a set-returning function that transforms nested JSON data into a tabular format. It generates six new columns: seq, key, path, index, value, and this. The this column contains the original JSON data. Due to the nature of set-returning functions, the number of rows can significantly increase, leading to substantial data duplication in the this column and potential OOM issues.

PR #13935 introduced column pruning for FLATTEN, which avoids generating data for columns not explicitly selected by the user. However, this optimization was bypassed when the SQL query included filter conditions.

This PR's Contribution:

This PR introduces filter condition matching to the FLATTEN function. Now, even when the SQL query contains filter statements, the column pruning optimization will still be applied. This means that the this column (and other unneeded columns) will not be populated with redundant data, even when filters are present, significantly reducing memory usage and improving query performance.

Benefits:

Reduced Memory Consumption: By applying column pruning even with filters, this PR minimizes data replication and reduces the risk of OOM errors.
Improved Query Performance: Less data to process translates to faster query execution times.

Verification:

We can verify the effectiveness of this optimization by using the EXPLAIN command. The output of flatten function have params to identify the columns that generated data. Params and columns have the following mapping relation.

param	column
1	seq
2	key
3	path
4	index
5	value
6	this

Example:

CREATE TABLE persons(id int, c variant);

INSERT INTO persons (id, c) VALUES
    (12712555, '{"name":{"first":"John","last":"Smith"},"contact":[{"business":[{"type":"phone","content":"555-1234"},{"type":"email","content":"[email protected]"}]}]}'),
    (98127771, '{"name":{"first":"Jane","last":"Doe"},"contact":[{"business":[{"type":"phone","content":"555-1236"},{"type":"email","content":"[email protected]"}]}]}');

SELECT p.id, f.key, f.path, f.value FROM persons p, LATERAL FLATTEN(input => p.c) f WHERE key = 'name';
╭─────────────────────────────────────────────────────────────────────────────────────────╮
│        id       │        key       │       path       │              value              │
│ Nullable(Int32) │ Nullable(String) │ Nullable(String) │        Nullable(Variant)        │
├─────────────────┼──────────────────┼──────────────────┼─────────────────────────────────┤
│        12712555 │ name             │ name             │ {"first":"John","last":"Smith"} │
│        98127771 │ name             │ name             │ {"first":"Jane","last":"Doe"}   │
╰─────────────────────────────────────────────────────────────────────────────────────────╯

EXPLAIN SELECT p.id, f.key, f.path, f.value FROM persons p, LATERAL FLATTEN(input => p.c) f WHERE key = 'name';
-[ EXPLAIN ]-----------------------------------
EvalScalar
├── output columns: [p.id (#0), key (#10), path (#11), value (#13)]
├── expressions: [get(2)(flatten(p.c (#1)) (#8)), get(3)(flatten(p.c (#1)) (#8)), get(5)(flatten(p.c (#1)) (#8))]
├── estimated rows: 1.20
└── Filter
    ├── output columns: [p.id (#0), flatten(p.c (#1)) (#8)]
    ├── filters: [is_true(get(2)(flatten(p.c (#1)) (#8)) = 'name')]
    ├── estimated rows: 1.20
    └── ProjectSet
        ├── output columns: [p.id (#0), flatten(p.c (#1)) (#8)]
        ├── estimated rows: 6.00
        ├── set returning functions: flatten(2, 3, 5)(p.c (#1))
        └── TableScan
            ├── table: default.default.persons
            ├── output columns: [id (#0), c (#1)]
            ├── read rows: 2
            ├── read size: < 1 KiB
            ├── partitions total: 1
            ├── partitions scanned: 1
            ├── pruning stats: [segments: <range pruning: 1 to 1>, blocks: <range pruning: 1 to 1>]
            ├── push downs: [filters: [], limit: NONE]
            └── estimated rows: 2.00

fixes: #[Link the issue here]

Tests

Unit Test
Logic Test
Benchmark Test
No Test - Explain why

Type of change

Bug Fix (non-breaking change which fixes an issue)
New Feature (non-breaking change which adds functionality)
Breaking Change (fix or feature that could cause existing functionality not to work as expected)
Documentation Update
Refactoring
Performance Improvement
Other (please describe):

This change is

refactor(query): Optimize FLATTEN function with filter condition

9da2e8a

github-actions bot added the pr-refactor this PR changes the code base without new features or bugfix label May 7, 2025

b41sh requested review from zhang2014 and sundy-li May 7, 2025 07:51

b41sh added 2 commits May 7, 2025 15:54

fix

0f5c3dc

Merge branch 'main' into feat-flatten-filter

e0c27b8

zhang2014 approved these changes May 7, 2025

View reviewed changes

sundy-li approved these changes May 7, 2025

View reviewed changes

zhang2014 merged commit 7308158 into databendlabs:main May 7, 2025
76 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(query): Optimize `FLATTEN` function with filter condition #17892

refactor(query): Optimize `FLATTEN` function with filter condition #17892

b41sh commented May 7, 2025 •

edited by drmingdrmer

Loading

refactor(query): Optimize FLATTEN function with filter condition #17892

refactor(query): Optimize FLATTEN function with filter condition #17892

Conversation

b41sh commented May 7, 2025 • edited by drmingdrmer Loading

Summary

Tests

Type of change

refactor(query): Optimize `FLATTEN` function with filter condition #17892

refactor(query): Optimize `FLATTEN` function with filter condition #17892

b41sh commented May 7, 2025 •

edited by drmingdrmer

Loading