Skip to content

refactor(query): Optimize FLATTEN function with filter condition #17892

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
May 7, 2025

Conversation

b41sh
Copy link
Member

@b41sh b41sh commented May 7, 2025

I hereby agree to the terms of the CLA available at: https://docs.databend.com/dev/policies/cla/

Summary

This PR enhances the performance of the FLATTEN function in Databend, specifically addressing the issue of potential Out-of-Memory (OOM) errors caused by data replication.

Background:

The FLATTEN function is a set-returning function that transforms nested JSON data into a tabular format. It generates six new columns: seq, key, path, index, value, and this. The this column contains the original JSON data. Due to the nature of set-returning functions, the number of rows can significantly increase, leading to substantial data duplication in the this column and potential OOM issues.

PR #13935 introduced column pruning for FLATTEN, which avoids generating data for columns not explicitly selected by the user. However, this optimization was bypassed when the SQL query included filter conditions.

This PR's Contribution:

This PR introduces filter condition matching to the FLATTEN function. Now, even when the SQL query contains filter statements, the column pruning optimization will still be applied. This means that the this column (and other unneeded columns) will not be populated with redundant data, even when filters are present, significantly reducing memory usage and improving query performance.

Benefits:

  • Reduced Memory Consumption: By applying column pruning even with filters, this PR minimizes data replication and reduces the risk of OOM errors.
  • Improved Query Performance: Less data to process translates to faster query execution times.

Verification:

We can verify the effectiveness of this optimization by using the EXPLAIN command. The output of flatten function have params to identify the columns that generated data. Params and columns have the following mapping relation.

param column
1 seq
2 key
3 path
4 index
5 value
6 this

Example:

CREATE TABLE persons(id int, c variant);

INSERT INTO persons (id, c) VALUES
    (12712555, '{"name":{"first":"John","last":"Smith"},"contact":[{"business":[{"type":"phone","content":"555-1234"},{"type":"email","content":"[email protected]"}]}]}'),
    (98127771, '{"name":{"first":"Jane","last":"Doe"},"contact":[{"business":[{"type":"phone","content":"555-1236"},{"type":"email","content":"[email protected]"}]}]}');

SELECT p.id, f.key, f.path, f.value FROM persons p, LATERAL FLATTEN(input => p.c) f WHERE key = 'name';
╭─────────────────────────────────────────────────────────────────────────────────────────╮
│        id       │        key       │       path       │              value              │
│ Nullable(Int32) │ Nullable(String) │ Nullable(String) │        Nullable(Variant)        │
├─────────────────┼──────────────────┼──────────────────┼─────────────────────────────────┤
│        12712555 │ name             │ name             │ {"first":"John","last":"Smith"} │
│        98127771 │ name             │ name             │ {"first":"Jane","last":"Doe"}   │
╰─────────────────────────────────────────────────────────────────────────────────────────╯

EXPLAIN SELECT p.id, f.key, f.path, f.value FROM persons p, LATERAL FLATTEN(input => p.c) f WHERE key = 'name';
-[ EXPLAIN ]-----------------------------------
EvalScalar
├── output columns: [p.id (#0), key (#10), path (#11), value (#13)]
├── expressions: [get(2)(flatten(p.c (#1)) (#8)), get(3)(flatten(p.c (#1)) (#8)), get(5)(flatten(p.c (#1)) (#8))]
├── estimated rows: 1.20
└── Filter
    ├── output columns: [p.id (#0), flatten(p.c (#1)) (#8)]
    ├── filters: [is_true(get(2)(flatten(p.c (#1)) (#8)) = 'name')]
    ├── estimated rows: 1.20
    └── ProjectSet
        ├── output columns: [p.id (#0), flatten(p.c (#1)) (#8)]
        ├── estimated rows: 6.00
        ├── set returning functions: flatten(2, 3, 5)(p.c (#1))
        └── TableScan
            ├── table: default.default.persons
            ├── output columns: [id (#0), c (#1)]
            ├── read rows: 2
            ├── read size: < 1 KiB
            ├── partitions total: 1
            ├── partitions scanned: 1
            ├── pruning stats: [segments: <range pruning: 1 to 1>, blocks: <range pruning: 1 to 1>]
            ├── push downs: [filters: [], limit: NONE]
            └── estimated rows: 2.00
  • fixes: #[Link the issue here]

Tests

  • Unit Test
  • Logic Test
  • Benchmark Test
  • No Test - Explain why

Type of change

  • Bug Fix (non-breaking change which fixes an issue)
  • New Feature (non-breaking change which adds functionality)
  • Breaking Change (fix or feature that could cause existing functionality not to work as expected)
  • Documentation Update
  • Refactoring
  • Performance Improvement
  • Other (please describe):

This change is Reviewable

@github-actions github-actions bot added the pr-refactor this PR changes the code base without new features or bugfix label May 7, 2025
@b41sh b41sh requested review from zhang2014 and sundy-li May 7, 2025 07:51
@zhang2014 zhang2014 merged commit 7308158 into databendlabs:main May 7, 2025
76 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pr-refactor this PR changes the code base without new features or bugfix
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants