Skip to content

Conversation

J-Meyers
Copy link
Contributor

@J-Meyers J-Meyers commented Sep 25, 2025

This adds support for relatively complex filters being pushed down via ComplexFilterPushdown
Followup to #471

Motivation:

As mentioned in my previous PR even relatively simple queries currently can't be pushed down without this

FROM partitoned_table
WHERE (partition_col_0 = 0 AND partition_col_1 = 1) OR partition_col_1 = 2

Where the manifesto says

Hidden Partitioning and Pruning: DuckLake is aware of data partitioning and table- and file-level statistics, allowing for early pruning of scans for maximum efficiency.

But it doesn't currently actually apply in a variety of cases, to the point of actually worse performance than hive structured files (I think) in some cases because the base duckdb will pushdown complex filters with hive

With optimizations like this I don't think any fair/realistic benchmarks can be made since it's just 1/however big N you want

Design:

The general design is for each expression try and pass them through the FilterCombiner since that can do a lot of work, then if there are leftovers recurse into the ORs and ANDs, if the combiner was able to extract useful groups of functions then parse them as we did before. Propogate when pieces are unsatisfiable, but in general keep as many restrictions as possible. Since we're interacting in SQL we can use that to handle whatever arbitrary expression trees we want rather than just being restricted to single column filters as in TableFilterSet.

The guiding principle is to have the catalog do as much work as possible given the expected balance between the size of ducklake (massive) and the size of data for common queries (small enough for a single machine).

This will likely add some tiny overhead on queries where the filters don't actually eliminate any files from being included but are complicated queries, particularly when the data in ducklake is of comparable size to the metadata (which seems incredibly rare).

Annoyingly, the actual creation of the queries has to be delayed a bit because we may later receive a dynamic pushdown, we want that to still keep all our old more complex filters and not add new filters unless otherwise necessary.

Performance:

When running the generated query via CLI on realistic datasets just running the query via cli takes < .05 seconds
However instead when logging the time within DuckLakeTransaction::Query when running the relevant user query the generated query takes ~0.25 seconds I'm not sure why there is this disconnect, but it is pretty significant

These numbers are when running with duckdb as the metadata database, the performance may be different on different metadata stores, but still should be faster than reading hundreds of parquet files.

We reference the same column for stats multiple times it's possible for there to be slight performance improvements by creating materialized CTE that gets the relevant stats for each column, but I'm not sure how significant it will be vs the added complexity. duckdb/duckdb#19080 should begin the process of doing it automatically at least within duckdb.

Flame graph running in debug

flamegraph_debug

Flame graph running in release

flamegraph

Detailed profiling output when running within ducklake (weird disconnect between the Total Time and the time of any actual step)

┌────────────────────────────────────────────────┐
│┌──────────────────────────────────────────────┐│
││              Total Time: 0.245s              ││
│└──────────────────────────────────────────────┘│
└────────────────────────────────────────────────┘
┌────────────────────────────────────────────────┐
│               Optimizer: 0.0010s               │
│┌──────────────────────────────────────────────┐│
││        Build Side Probe Side: 0.0000s        ││
││           Column Lifetime: 0.0001s           ││
││           Common Aggregate: 0.0000s          ││
││        Common Subexpressions: 0.0000s        ││
││      Compressed Materialization: 0.0000s     ││
││          Cte Filter Pusher: 0.0000s          ││
││             Cte Inlining: 0.0000s            ││
││             Deliminator: 0.0000s             ││
││           Duplicate Groups: 0.0000s          ││
││         Empty Result Pullup: 0.0000s         ││
││         Expression Rewriter: 0.0004s         ││
││              Extension: 0.0000s              ││
││            Filter Pullup: 0.0000s            ││
││           Filter Pushdown: 0.0002s           ││
││              In Clause: 0.0000s              ││
││         Join Filter Pushdown: 0.0000s        ││
││              Join Order: 0.0002s             ││
││         Late Materialization: 0.0000s        ││
││            Limit Pushdown: 0.0000s           ││
││           Materialized Cte: 0.0000s          ││
││             Regex Range: 0.0000s             ││
││            Reorder Filter: 0.0000s           ││
││          Sampling Pushdown: 0.0000s          ││
││        Statistics Propagation: 0.0001s       ││
││             Sum Rewriter: 0.0000s            ││
││                Top N: 0.0000s                ││
││           Unnest Rewriter: 0.0000s           ││
││            Unused Columns: 0.0000s           ││
│└──────────────────────────────────────────────┘│
└────────────────────────────────────────────────┘
┌────────────────────────────────────────────────┐
│            Physical planner: 0.0001s           │
│┌──────────────────────────────────────────────┐│
││            Column Binding: 0.0000s           ││
││             Create Plan: 0.0001s             ││
││            Resolve Types: 0.0000s            ││
│└──────────────────────────────────────────────┘│
└────────────────────────────────────────────────┘
┌────────────────────────────────────────────────┐
│                Planner: 0.0005s                │
│┌──────────────────────────────────────────────┐│
││               Binding: 0.0005s               ││
│└──────────────────────────────────────────────┘│
└────────────────────────────────────────────────┘
┌───────────────────────────┐
│           QUERY           │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│         PROJECTION        │
│    ────────────────────   │
│            path           │
│      path_is_relative     │
│      file_size_bytes      │
│        footer_size        │
│        row_id_start       │
│       begin_snapshot      │
│     partial_file_info     │
│         mapping_id        │
│            path           │
│      path_is_relative     │
│      file_size_bytes      │
│        footer_size        │
│                           │
│           5 rows          │
│          (0.00s)          │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│         PROJECTION        │
│    ────────────────────   │
│             #0            │
│             #1            │
│             #2            │
│             #3            │
│             #4            │
│             #5            │
│             #6            │
│             #7            │
│             #8            │
│             #9            │
│            #10            │
│            #11            │
│                           │
│           5 rows          │
│          (0.00s)          │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│           FILTER          │
│    ────────────────────   │
│ (SUBQUERY OR (SUBQUERY AND│
│   SUBQUERY) OR (SUBQUERY  │
│       AND SUBQUERY))      │
│                           │
│           5 rows          │
│          (0.00s)          │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│         HASH_JOIN         │
│    ────────────────────   │
│      Join Type: MARK      │
│                           │
│        Conditions:        ├───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│     data_file_id = #0     │                                                                                                                                                               │
│                           │                                                                                                                                                               │
│        65,740 rows        │                                                                                                                                                               │
│          (0.00s)          │                                                                                                                                                               │
└─────────────┬─────────────┘                                                                                                                                                               │
┌─────────────┴─────────────┐                                                                                                                                                 ┌─────────────┴─────────────┐
│         HASH_JOIN         │                                                                                                                                                 │         PROJECTION        │
│    ────────────────────   │                                                                                                                                                 │    ────────────────────   │
│      Join Type: MARK      │                                                                                                                                                 │             #2            │
│                           │                                                                                                                                                 │                           │
│        Conditions:        ├──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐              │                           │
│     data_file_id = #0     │                                                                                                                                  │              │                           │
│                           │                                                                                                                                  │              │                           │
│        65,740 rows        │                                                                                                                                  │              │           4 rows          │
│          (0.00s)          │                                                                                                                                  │              │          (0.00s)          │
└─────────────┬─────────────┘                                                                                                                                  │              └─────────────┬─────────────┘
┌─────────────┴─────────────┐                                                                                                                    ┌─────────────┴─────────────┐┌─────────────┴─────────────┐
│         HASH_JOIN         │                                                                                                                    │         PROJECTION        ││           FILTER          │
│    ────────────────────   │                                                                                                                    │    ────────────────────   ││    ────────────────────   │
│      Join Type: MARK      │                                                                                                                    │             #2            ││  ((max_value IS NULL) OR  │
│                           │                                                                                                                    │                           ││ (min_value IS NULL) OR (( │
│        Conditions:        │                                                                                                                    │                           ││ '8900_10000' >= min_value)│
│     data_file_id = #0     ├─────────────────────────────────────────────────────────────────────────────────────────────────────┐              │                           ││    AND ('8900_10000' <=   │
│                           │                                                                                                     │              │                           ││        max_value)))       │
│                           │                                                                                                     │              │                           ││                           │
│        65,740 rows        │                                                                                                     │              │           1 row           ││           4 rows          │
│          (0.00s)          │                                                                                                     │              │          (0.00s)          ││          (0.00s)          │
└─────────────┬─────────────┘                                                                                                     │              └─────────────┬─────────────┘└─────────────┬─────────────┘
┌─────────────┴─────────────┐                                                                                       ┌─────────────┴─────────────┐┌─────────────┴─────────────┐┌─────────────┴─────────────┐
│         HASH_JOIN         │                                                                                       │         PROJECTION        ││           FILTER          ││         TABLE_SCAN        │
│    ────────────────────   │                                                                                       │    ────────────────────   ││    ────────────────────   ││    ────────────────────   │
│      Join Type: MARK      │                                                                                       │             #2            ││  ((max_value IS NULL) OR  ││           Table:          │
│                           │                                                                                       │                           ││ (min_value IS NULL) OR (( ││ ducklake_file_column_stats│
│        Conditions:        │                                                                                       │                           ││ '9400_12700' >= min_value)││                           │
│     data_file_id = #0     │                                                                                       │                           ││    AND ('9400_12700' <=   ││   Type: Sequential Scan   │
│                           │                                                                                       │                           ││        max_value)))       ││                           │
│                           │                                                                                       │                           ││                           ││        Projections:       │
│                           │                                                                                       │                           ││                           ││         max_value         │
│                           ├────────────────────────────────────────────────────────────────────────┐              │                           ││                           ││         min_value         │
│                           │                                                                        │              │                           ││                           ││        data_file_id       │
│                           │                                                                        │              │                           ││                           ││                           │
│                           │                                                                        │              │                           ││                           ││          Filters:         │
│                           │                                                                        │              │                           ││                           ││        table_id=155       │
│                           │                                                                        │              │                           ││                           ││        column_id=19       │
│                           │                                                                        │              │                           ││                           ││                           │
│        65,740 rows        │                                                                        │              │        30,984 rows        ││           1 row           ││        65,740 rows        │
│          (0.00s)          │                                                                        │              │          (0.00s)          ││          (0.00s)          ││          (0.06s)          │
└─────────────┬─────────────┘                                                                        │              └─────────────┬─────────────┘└─────────────┬─────────────┘└───────────────────────────┘
┌─────────────┴─────────────┐                                                          ┌─────────────┴─────────────┐┌─────────────┴─────────────┐┌─────────────┴─────────────┐
│         HASH_JOIN         │                                                          │         PROJECTION        ││           FILTER          ││         TABLE_SCAN        │
│    ────────────────────   │                                                          │    ────────────────────   ││    ────────────────────   ││    ────────────────────   │
│      Join Type: MARK      │                                                          │             #2            ││  ((max_value IS NULL) OR  ││           Table:          │
│                           │                                                          │                           ││ (min_value IS NULL) OR ((1││ ducklake_file_column_stats│
│        Conditions:        │                                                          │                           ││  >= TRY_CAST(min_value AS ││                           │
│     data_file_id = #0     │                                                          │                           ││     BIGINT)) AND (1 <=    ││   Type: Sequential Scan   │
│                           │                                                          │                           ││    TRY_CAST(max_value AS  ││                           │
│                           │                                                          │                           ││         BIGINT))))        ││        Projections:       │
│                           │                                                          │                           ││                           ││         max_value         │
│                           ├───────────────────────────────────────────┐              │                           ││                           ││         min_value         │
│                           │                                           │              │                           ││                           ││        data_file_id       │
│                           │                                           │              │                           ││                           ││                           │
│                           │                                           │              │                           ││                           ││          Filters:         │
│                           │                                           │              │                           ││                           ││        table_id=155       │
│                           │                                           │              │                           ││                           ││        column_id=19       │
│                           │                                           │              │                           ││                           ││                           │
│        65,740 rows        │                                           │              │           1 row           ││        30,984 rows        ││        65,740 rows        │
│          (0.00s)          │                                           │              │          (0.00s)          ││          (0.00s)          ││          (0.04s)          │
└─────────────┬─────────────┘                                           │              └─────────────┬─────────────┘└─────────────┬─────────────┘└───────────────────────────┘
┌─────────────┴─────────────┐                             ┌─────────────┴─────────────┐┌─────────────┴─────────────┐┌─────────────┴─────────────┐
│         HASH_JOIN         │                             │         PROJECTION        ││           FILTER          ││         TABLE_SCAN        │
│    ────────────────────   │                             │    ────────────────────   ││    ────────────────────   ││    ────────────────────   │
│      Join Type: LEFT      │                             │             #2            ││  ((min_value IS NULL) OR  ││           Table:          │
│                           │                             │                           ││  (max_value IS NULL) OR ( ││ ducklake_file_column_stats│
│        Conditions:        │                             │                           ││ (max_value > '9400_12699')││                           │
│data_file_id = data_file_id│                             │                           ││      AND (min_value <     ││   Type: Sequential Scan   │
│                           │                             │                           ││      '9400_12701')))      ││                           │
│                           │                             │                           ││                           ││        Projections:       │
│                           │                             │                           ││                           ││         max_value         │
│                           ├──────────────┐              │                           ││                           ││         min_value         │
│                           │              │              │                           ││                           ││        data_file_id       │
│                           │              │              │                           ││                           ││                           │
│                           │              │              │                           ││                           ││          Filters:         │
│                           │              │              │                           ││                           ││        table_id=155       │
│                           │              │              │                           ││                           ││        column_id=21       │
│                           │              │              │                           ││                           ││                           │
│        65,740 rows        │              │              │        33,910 rows        ││           1 row           ││        65,740 rows        │
│          (0.00s)          │              │              │          (0.00s)          ││          (0.00s)          ││          (0.04s)          │
└─────────────┬─────────────┘              │              └─────────────┬─────────────┘└─────────────┬─────────────┘└───────────────────────────┘
┌─────────────┴─────────────┐┌─────────────┴─────────────┐┌─────────────┴─────────────┐┌─────────────┴─────────────┐
│         TABLE_SCAN        ││         TABLE_SCAN        ││           FILTER          ││         TABLE_SCAN        │
│    ────────────────────   ││    ────────────────────   ││    ────────────────────   ││    ────────────────────   │
│           Table:          ││           Table:          ││  ((max_value IS NULL) OR  ││           Table:          │
│     ducklake_data_file    ││    ducklake_delete_file   ││ (min_value IS NULL) OR ((0││ ducklake_file_column_stats│
│                           ││                           ││  >= TRY_CAST(min_value AS ││                           │
│   Type: Sequential Scan   ││   Type: Sequential Scan   ││     BIGINT)) AND (0 <=    ││   Type: Sequential Scan   │
│                           ││                           ││    TRY_CAST(max_value AS  ││                           │
│        Projections:       ││        Projections:       ││         BIGINT))))        ││        Projections:       │
│        data_file_id       ││        data_file_id       ││                           ││         min_value         │
│       begin_snapshot      ││            path           ││                           ││         max_value         │
│            path           ││      path_is_relative     ││                           ││        data_file_id       │
│      path_is_relative     ││      file_size_bytes      ││                           ││                           │
│      file_size_bytes      ││        footer_size        ││                           ││          Filters:         │
│        footer_size        ││                           ││                           ││        table_id=155       │
│        row_id_start       ││          Filters:         ││                           ││        column_id=19       │
│     partial_file_info     ││        table_id=155       ││                           ││                           │
│         mapping_id        ││    begin_snapshot<=310    ││                           ││                           │
│                           ││    (310 < end_snapshot)   ││                           ││                           │
│          Filters:         ││                           ││                           ││                           │
│        table_id=155       ││                           ││                           ││                           │
│ ((310 < end_snapshot) OR  ││                           ││                           ││                           │
│  (end_snapshot IS NULL))  ││                           ││                           ││                           │
│                           ││                           ││                           ││                           │
│        65,740 rows        ││           0 rows          ││        33,910 rows        ││        65,740 rows        │
│          (0.01s)          ││          (0.00s)          ││          (0.00s)          ││          (0.04s)          │
└───────────────────────────┘└───────────────────────────┘└─────────────┬─────────────┘└───────────────────────────┘
                                                          ┌─────────────┴─────────────┐
                                                          │         TABLE_SCAN        │
                                                          │    ────────────────────   │
                                                          │           Table:          │
                                                          │ ducklake_file_column_stats│
                                                          │                           │
                                                          │   Type: Sequential Scan   │
                                                          │                           │
                                                          │        Projections:       │
                                                          │         max_value         │
                                                          │         min_value         │
                                                          │        data_file_id       │
                                                          │                           │
                                                          │          Filters:         │
                                                          │        table_id=155       │
                                                          │        column_id=21       │
                                                          │                           │
                                                          │        65,740 rows        │
                                                          │          (0.04s)          │
                                                          └───────────────────────────┘



Json Output (does not show the extra delay)

{
    "all_optimizers": 0.0,
    "cumulative_optimizer_timing": 0.0,
    "planner": 0.0,
    "planner_binding": 0.0,
    "physical_planner": 0.0,
    "physical_planner_column_binding": 0.0,
    "physical_planner_resolve_types": 0.0,
    "physical_planner_create_plan": 0.0,
    "optimizer_expression_rewriter": 0.0,
    "optimizer_filter_pullup": 0.0,
    "optimizer_filter_pushdown": 0.0,
    "optimizer_empty_result_pullup": 0.0,
    "optimizer_cte_filter_pusher": 0.0,
    "total_bytes_written": 0,
    "total_bytes_read": 0,
    "rows_returned": 0,
    "optimizer_cte_inlining": 0.0,
    "latency": 0.0,
    "optimizer_late_materialization": 0.0,
    "result_set_size": 0,
    "optimizer_sum_rewriter": 0.0,
    "optimizer_materialized_cte": 0.0,
    "optimizer_extension": 0.0,
    "cumulative_rows_scanned": 0,
    "optimizer_join_filter_pushdown": 0.0,
    "optimizer_sampling_pushdown": 0.0,
    "optimizer_reorder_filter": 0.0,
    "cumulative_cardinality": 0,
    "optimizer_duplicate_groups": 0.0,
    "extra_info": {},
    "optimizer_compressed_materialization": 0.0,
    "cpu_time": 0.0,
    "optimizer_top_n": 0.0,
    "system_peak_temp_dir_size": 0,
    "system_peak_buffer_memory": 0,
    "blocked_thread_time": 0.0,
    "optimizer_limit_pushdown": 0.0,
    "query_name": "",
    "optimizer_build_side_probe_side": 0.0,
    "optimizer_column_lifetime": 0.0,
    "optimizer_common_aggregate": 0.0,
    "optimizer_common_subexpressions": 0.0,
    "optimizer_statistics_propagation": 0.0,
    "optimizer_unused_columns": 0.0,
    "optimizer_unnest_rewriter": 0.0,
    "optimizer_deliminator": 0.0,
    "optimizer_join_order": 0.0,
    "optimizer_in_clause": 0.0,
    "optimizer_regex_range": 0.0,
    "children": [
        {
            "total_bytes_written": 0,
            "total_bytes_read": 0,
            "result_set_size": 0,
            "operator_timing": 9e-8,
            "operator_rows_scanned": 0,
            "cumulative_rows_scanned": 0,
            "operator_cardinality": 0,
            "operator_type": "EXPLAIN_ANALYZE",
            "operator_name": "EXPLAIN_ANALYZE",
            "cumulative_cardinality": 0,
            "extra_info": {},
            "cpu_time": 0.0,
            "children": [
                {
                    "total_bytes_written": 0,
                    "total_bytes_read": 0,
                    "result_set_size": 530,
                    "operator_timing": 2.81e-7,
                    "operator_rows_scanned": 0,
                    "cumulative_rows_scanned": 0,
                    "operator_cardinality": 5,
                    "operator_type": "PROJECTION",
                    "operator_name": "PROJECTION",
                    "cumulative_cardinality": 0,
                    "extra_info": {
                        "Projections": [
                            "path",
                            "path_is_relative",
                            "file_size_bytes",
                            "footer_size",
                            "row_id_start",
                            "begin_snapshot",
                            "partial_file_info",
                            "mapping_id",
                            "path",
                            "path_is_relative",
                            "file_size_bytes",
                            "footer_size"
                        ],
                        "Estimated Cardinality": "89"
                    },
                    "cpu_time": 0.0,
                    "children": [
                        {
                            "total_bytes_written": 0,
                            "total_bytes_read": 0,
                            "result_set_size": 530,
                            "operator_timing": 3.81e-7,
                            "operator_rows_scanned": 0,
                            "cumulative_rows_scanned": 0,
                            "operator_cardinality": 5,
                            "operator_type": "PROJECTION",
                            "operator_name": "PROJECTION",
                            "cumulative_cardinality": 0,
                            "extra_info": {
                                "Projections": [
                                    "#0",
                                    "#1",
                                    "#2",
                                    "#3",
                                    "#4",
                                    "#5",
                                    "#6",
                                    "#7",
                                    "#8",
                                    "#9",
                                    "#10",
                                    "#11"
                                ],
                                "Estimated Cardinality": "89"
                            },
                            "cpu_time": 0.0,
                            "children": [
                                {
                                    "total_bytes_written": 0,
                                    "total_bytes_read": 0,
                                    "result_set_size": 555,
                                    "operator_timing": 0.00032106999999999985,
                                    "operator_rows_scanned": 0,
                                    "cumulative_rows_scanned": 0,
                                    "operator_cardinality": 5,
                                    "operator_type": "FILTER",
                                    "operator_name": "FILTER",
                                    "cumulative_cardinality": 0,
                                    "extra_info": {
                                        "Expression": "(SUBQUERY OR (SUBQUERY AND SUBQUERY) OR (SUBQUERY AND SUBQUERY))",
                                        "Estimated Cardinality": "89"
                                    },
                                    "cpu_time": 0.0,
                                    "children": [
                                        {
                                            "total_bytes_written": 0,
                                            "total_bytes_read": 0,
                                            "result_set_size": 7297140,
                                            "operator_timing": 0.0002434799999999999,
                                            "operator_rows_scanned": 0,
                                            "cumulative_rows_scanned": 0,
                                            "operator_cardinality": 65740,
                                            "operator_type": "HASH_JOIN",
                                            "operator_name": "HASH_JOIN",
                                            "cumulative_cardinality": 0,
                                            "extra_info": {
                                                "Join Type": "MARK",
                                                "Conditions": "data_file_id = #0",
                                                "Estimated Cardinality": "449"
                                            },
                                            "cpu_time": 0.0,
                                            "children": [
                                                {
                                                    "total_bytes_written": 0,
                                                    "total_bytes_read": 0,
                                                    "result_set_size": 7757320,
                                                    "operator_timing": 0.00018701699999999996,
                                                    "operator_rows_scanned": 0,
                                                    "cumulative_rows_scanned": 0,
                                                    "operator_cardinality": 65740,
                                                    "operator_type": "HASH_JOIN",
                                                    "operator_name": "HASH_JOIN",
                                                    "cumulative_cardinality": 0,
                                                    "extra_info": {
                                                        "Join Type": "MARK",
                                                        "Conditions": "data_file_id = #0",
                                                        "Estimated Cardinality": "449"
                                                    },
                                                    "cpu_time": 0.0,
                                                    "children": [
                                                        {
                                                            "total_bytes_written": 0,
                                                            "total_bytes_read": 0,
                                                            "result_set_size": 7691580,
                                                            "operator_timing": 0.002068704,
                                                            "operator_rows_scanned": 0,
                                                            "cumulative_rows_scanned": 0,
                                                            "operator_cardinality": 65740,
                                                            "operator_type": "HASH_JOIN",
                                                            "operator_name": "HASH_JOIN",
                                                            "cumulative_cardinality": 0,
                                                            "extra_info": {
                                                                "Join Type": "MARK",
                                                                "Conditions": "data_file_id = #0",
                                                                "Estimated Cardinality": "449"
                                                            },
                                                            "cpu_time": 0.0,
                                                            "children": [
                                                                {
                                                                    "total_bytes_written": 0,
                                                                    "total_bytes_read": 0,
                                                                    "result_set_size": 7625840,
                                                                    "operator_timing": 0.00018682599999999994,
                                                                    "operator_rows_scanned": 0,
                                                                    "cumulative_rows_scanned": 0,
                                                                    "operator_cardinality": 65740,
                                                                    "operator_type": "HASH_JOIN",
                                                                    "operator_name": "HASH_JOIN",
                                                                    "cumulative_cardinality": 0,
                                                                    "extra_info": {
                                                                        "Join Type": "MARK",
                                                                        "Conditions": "data_file_id = #0",
                                                                        "Estimated Cardinality": "449"
                                                                    },
                                                                    "cpu_time": 0.0,
                                                                    "children": [
                                                                        {
                                                                            "total_bytes_written": 0,
                                                                            "total_bytes_read": 0,
                                                                            "result_set_size": 7560100,
                                                                            "operator_timing": 0.0023192239999999995,
                                                                            "operator_rows_scanned": 0,
                                                                            "cumulative_rows_scanned": 0,
                                                                            "operator_cardinality": 65740,
                                                                            "operator_type": "HASH_JOIN",
                                                                            "operator_name": "HASH_JOIN",
                                                                            "cumulative_cardinality": 0,
                                                                            "extra_info": {
                                                                                "Join Type": "MARK",
                                                                                "Conditions": "data_file_id = #0",
                                                                                "Estimated Cardinality": "449"
                                                                            },
                                                                            "cpu_time": 0.0,
                                                                            "children": [
                                                                                {
                                                                                    "total_bytes_written": 0,
                                                                                    "total_bytes_read": 0,
                                                                                    "result_set_size": 7494360,
                                                                                    "operator_timing": 0.000019537999999999998,
                                                                                    "operator_rows_scanned": 0,
                                                                                    "cumulative_rows_scanned": 0,
                                                                                    "operator_cardinality": 65740,
                                                                                    "operator_type": "HASH_JOIN",
                                                                                    "operator_name": "HASH_JOIN",
                                                                                    "cumulative_cardinality": 0,
                                                                                    "extra_info": {
                                                                                        "Join Type": "LEFT",
                                                                                        "Conditions": "data_file_id = data_file_id",
                                                                                        "Estimated Cardinality": "449"
                                                                                    },
                                                                                    "cpu_time": 0.0,
                                                                                    "children": [
                                                                                        {
                                                                                            "total_bytes_written": 0,
                                                                                            "total_bytes_read": 0,
                                                                                            "result_set_size": 5324940,
                                                                                            "operator_timing": 0.0036477790000000003,
                                                                                            "operator_rows_scanned": 71267,
                                                                                            "cumulative_rows_scanned": 0,
                                                                                            "operator_cardinality": 65740,
                                                                                            "operator_type": "TABLE_SCAN",
                                                                                            "operator_name": "SEQ_SCAN ",
                                                                                            "cumulative_cardinality": 0,
                                                                                            "extra_info": {
                                                                                                "Table": "ducklake_data_file",
                                                                                                "Type": "Sequential Scan",
                                                                                                "Projections": [
                                                                                                    "data_file_id",
                                                                                                    "begin_snapshot",
                                                                                                    "path",
                                                                                                    "path_is_relative",
                                                                                                    "file_size_bytes",
                                                                                                    "footer_size",
                                                                                                    "row_id_start",
                                                                                                    "partial_file_info",
                                                                                                    "mapping_id"
                                                                                                ],
                                                                                                "Filters": [
                                                                                                    "table_id=155",
                                                                                                    "((310 < end_snapshot) OR (end_snapshot IS NULL))"
                                                                                                ],
                                                                                                "Estimated Cardinality": "449"
                                                                                            },
                                                                                            "cpu_time": 0.0,
                                                                                            "children": []
                                                                                        },
                                                                                        {
                                                                                            "total_bytes_written": 0,
                                                                                            "total_bytes_read": 0,
                                                                                            "result_set_size": 0,
                                                                                            "operator_timing": 3.11e-7,
                                                                                            "operator_rows_scanned": 0,
                                                                                            "cumulative_rows_scanned": 0,
                                                                                            "operator_cardinality": 0,
                                                                                            "operator_type": "TABLE_SCAN",
                                                                                            "operator_name": "SEQ_SCAN ",
                                                                                            "cumulative_cardinality": 0,
                                                                                            "extra_info": {
                                                                                                "Table": "ducklake_delete_file",
                                                                                                "Type": "Sequential Scan",
                                                                                                "Projections": [
                                                                                                    "data_file_id",
                                                                                                    "path",
                                                                                                    "path_is_relative",
                                                                                                    "file_size_bytes",
                                                                                                    "footer_size"
                                                                                                ],
                                                                                                "Filters": [
                                                                                                    "table_id=155",
                                                                                                    "begin_snapshot<=310",
                                                                                                    "(310 < end_snapshot)"
                                                                                                ],
                                                                                                "Estimated Cardinality": "0"
                                                                                            },
                                                                                            "cpu_time": 0.0,
                                                                                            "children": []
                                                                                        }
                                                                                    ]
                                                                                },
                                                                                {
                                                                                    "total_bytes_written": 0,
                                                                                    "total_bytes_read": 0,
                                                                                    "result_set_size": 271280,
                                                                                    "operator_timing": 0.000023602000000000053,
                                                                                    "operator_rows_scanned": 0,
                                                                                    "cumulative_rows_scanned": 0,
                                                                                    "operator_cardinality": 33910,
                                                                                    "operator_type": "PROJECTION",
                                                                                    "operator_name": "PROJECTION",
                                                                                    "cumulative_cardinality": 0,
                                                                                    "extra_info": {
                                                                                        "Projections": "#2",
                                                                                        "Estimated Cardinality": "1887"
                                                                                    },
                                                                                    "cpu_time": 0.0,
                                                                                    "children": [
                                                                                        {
                                                                                            "total_bytes_written": 0,
                                                                                            "total_bytes_read": 0,
                                                                                            "result_set_size": 1356400,
                                                                                            "operator_timing": 0.0019298339999999985,
                                                                                            "operator_rows_scanned": 0,
                                                                                            "cumulative_rows_scanned": 0,
                                                                                            "operator_cardinality": 33910,
                                                                                            "operator_type": "FILTER",
                                                                                            "operator_name": "FILTER",
                                                                                            "cumulative_cardinality": 0,
                                                                                            "extra_info": {
                                                                                                "Expression": "((max_value IS NULL) OR (min_value IS NULL) OR ((0 >= TRY_CAST(min_value AS BIGINT)) AND (0 <= TRY_CAST(max_value AS BIGINT))))",
                                                                                                "Estimated Cardinality": "1887"
                                                                                            },
                                                                                            "cpu_time": 0.0,
                                                                                            "children": [
                                                                                                {
                                                                                                    "total_bytes_written": 0,
                                                                                                    "total_bytes_read": 0,
                                                                                                    "result_set_size": 2629600,
                                                                                                    "operator_timing": 0.033753488,
                                                                                                    "operator_rows_scanned": 19507982,
                                                                                                    "cumulative_rows_scanned": 0,
                                                                                                    "operator_cardinality": 65740,
                                                                                                    "operator_type": "TABLE_SCAN",
                                                                                                    "operator_name": "SEQ_SCAN ",
                                                                                                    "cumulative_cardinality": 0,
                                                                                                    "extra_info": {
                                                                                                        "Table": "ducklake_file_column_stats",
                                                                                                        "Type": "Sequential Scan",
                                                                                                        "Projections": [
                                                                                                            "max_value",
                                                                                                            "min_value",
                                                                                                            "data_file_id"
                                                                                                        ],
                                                                                                        "Filters": [
                                                                                                            "table_id=155",
                                                                                                            "column_id=21"
                                                                                                        ],
                                                                                                        "Estimated Cardinality": "9438"
                                                                                                    },
                                                                                                    "cpu_time": 0.0,
                                                                                                    "children": []
                                                                                                }
                                                                                            ]
                                                                                        }
                                                                                    ]
                                                                                }
                                                                            ]
                                                                        },
                                                                        {
                                                                            "total_bytes_written": 0,
                                                                            "total_bytes_read": 0,
                                                                            "result_set_size": 8,
                                                                            "operator_timing": 1.8e-7,
                                                                            "operator_rows_scanned": 0,
                                                                            "cumulative_rows_scanned": 0,
                                                                            "operator_cardinality": 1,
                                                                            "operator_type": "PROJECTION",
                                                                            "operator_name": "PROJECTION",
                                                                            "cumulative_cardinality": 0,
                                                                            "extra_info": {
                                                                                "Projections": "#2",
                                                                                "Estimated Cardinality": "1887"
                                                                            },
                                                                            "cpu_time": 0.0,
                                                                            "children": [
                                                                                {
                                                                                    "total_bytes_written": 0,
                                                                                    "total_bytes_read": 0,
                                                                                    "result_set_size": 40,
                                                                                    "operator_timing": 0.001393194000000002,
                                                                                    "operator_rows_scanned": 0,
                                                                                    "cumulative_rows_scanned": 0,
                                                                                    "operator_cardinality": 1,
                                                                                    "operator_type": "FILTER",
                                                                                    "operator_name": "FILTER",
                                                                                    "cumulative_cardinality": 0,
                                                                                    "extra_info": {
                                                                                        "Expression": "((min_value IS NULL) OR (max_value IS NULL) OR ((max_value > '9400_12699') AND (min_value < '9400_12701')))",
                                                                                        "Estimated Cardinality": "1887"
                                                                                    },
                                                                                    "cpu_time": 0.0,
                                                                                    "children": [
                                                                                        {
                                                                                            "total_bytes_written": 0,
                                                                                            "total_bytes_read": 0,
                                                                                            "result_set_size": 2629600,
                                                                                            "operator_timing": 0.034515498,
                                                                                            "operator_rows_scanned": 19507982,
                                                                                            "cumulative_rows_scanned": 0,
                                                                                            "operator_cardinality": 65740,
                                                                                            "operator_type": "TABLE_SCAN",
                                                                                            "operator_name": "SEQ_SCAN ",
                                                                                            "cumulative_cardinality": 0,
                                                                                            "extra_info": {
                                                                                                "Table": "ducklake_file_column_stats",
                                                                                                "Type": "Sequential Scan",
                                                                                                "Projections": [
                                                                                                    "min_value",
                                                                                                    "max_value",
                                                                                                    "data_file_id"
                                                                                                ],
                                                                                                "Filters": [
                                                                                                    "table_id=155",
                                                                                                    "column_id=19"
                                                                                                ],
                                                                                                "Estimated Cardinality": "9438"
                                                                                            },
                                                                                            "cpu_time": 0.0,
                                                                                            "children": []
                                                                                        }
                                                                                    ]
                                                                                }
                                                                            ]
                                                                        }
                                                                    ]
                                                                },
                                                                {
                                                                    "total_bytes_written": 0,
                                                                    "total_bytes_read": 0,
                                                                    "result_set_size": 247872,
                                                                    "operator_timing": 0.000012607999999999983,
                                                                    "operator_rows_scanned": 0,
                                                                    "cumulative_rows_scanned": 0,
                                                                    "operator_cardinality": 30984,
                                                                    "operator_type": "PROJECTION",
                                                                    "operator_name": "PROJECTION",
                                                                    "cumulative_cardinality": 0,
                                                                    "extra_info": {
                                                                        "Projections": "#2",
                                                                        "Estimated Cardinality": "1887"
                                                                    },
                                                                    "cpu_time": 0.0,
                                                                    "children": [
                                                                        {
                                                                            "total_bytes_written": 0,
                                                                            "total_bytes_read": 0,
                                                                            "result_set_size": 1239360,
                                                                            "operator_timing": 0.002247152,
                                                                            "operator_rows_scanned": 0,
                                                                            "cumulative_rows_scanned": 0,
                                                                            "operator_cardinality": 30984,
                                                                            "operator_type": "FILTER",
                                                                            "operator_name": "FILTER",
                                                                            "cumulative_cardinality": 0,
                                                                            "extra_info": {
                                                                                "Expression": "((max_value IS NULL) OR (min_value IS NULL) OR ((1 >= TRY_CAST(min_value AS BIGINT)) AND (1 <= TRY_CAST(max_value AS BIGINT))))",
                                                                                "Estimated Cardinality": "1887"
                                                                            },
                                                                            "cpu_time": 0.0,
                                                                            "children": [
                                                                                {
                                                                                    "total_bytes_written": 0,
                                                                                    "total_bytes_read": 0,
                                                                                    "result_set_size": 2629600,
                                                                                    "operator_timing": 0.03399253999999999,
                                                                                    "operator_rows_scanned": 19507982,
                                                                                    "cumulative_rows_scanned": 0,
                                                                                    "operator_cardinality": 65740,
                                                                                    "operator_type": "TABLE_SCAN",
                                                                                    "operator_name": "SEQ_SCAN ",
                                                                                    "cumulative_cardinality": 0,
                                                                                    "extra_info": {
                                                                                        "Table": "ducklake_file_column_stats",
                                                                                        "Type": "Sequential Scan",
                                                                                        "Projections": [
                                                                                            "max_value",
                                                                                            "min_value",
                                                                                            "data_file_id"
                                                                                        ],
                                                                                        "Filters": [
                                                                                            "table_id=155",
                                                                                            "column_id=21"
                                                                                        ],
                                                                                        "Estimated Cardinality": "9438"
                                                                                    },
                                                                                    "cpu_time": 0.0,
                                                                                    "children": []
                                                                                }
                                                                            ]
                                                                        }
                                                                    ]
                                                                }
                                                            ]
                                                        },
                                                        {
                                                            "total_bytes_written": 0,
                                                            "total_bytes_read": 0,
                                                            "result_set_size": 8,
                                                            "operator_timing": 2.31e-7,
                                                            "operator_rows_scanned": 0,
                                                            "cumulative_rows_scanned": 0,
                                                            "operator_cardinality": 1,
                                                            "operator_type": "PROJECTION",
                                                            "operator_name": "PROJECTION",
                                                            "cumulative_cardinality": 0,
                                                            "extra_info": {
                                                                "Projections": "#2",
                                                                "Estimated Cardinality": "1887"
                                                            },
                                                            "cpu_time": 0.0,
                                                            "children": [
                                                                {
                                                                    "total_bytes_written": 0,
                                                                    "total_bytes_read": 0,
                                                                    "result_set_size": 40,
                                                                    "operator_timing": 0.0013645559999999983,
                                                                    "operator_rows_scanned": 0,
                                                                    "cumulative_rows_scanned": 0,
                                                                    "operator_cardinality": 1,
                                                                    "operator_type": "FILTER",
                                                                    "operator_name": "FILTER",
                                                                    "cumulative_cardinality": 0,
                                                                    "extra_info": {
                                                                        "Expression": "((max_value IS NULL) OR (min_value IS NULL) OR (('9400_12700' >= min_value) AND ('9400_12700' <= max_value)))",
                                                                        "Estimated Cardinality": "1887"
                                                                    },
                                                                    "cpu_time": 0.0,
                                                                    "children": [
                                                                        {
                                                                            "total_bytes_written": 0,
                                                                            "total_bytes_read": 0,
                                                                            "result_set_size": 2629600,
                                                                            "operator_timing": 0.034872303999999965,
                                                                            "operator_rows_scanned": 19507982,
                                                                            "cumulative_rows_scanned": 0,
                                                                            "operator_cardinality": 65740,
                                                                            "operator_type": "TABLE_SCAN",
                                                                            "operator_name": "SEQ_SCAN ",
                                                                            "cumulative_cardinality": 0,
                                                                            "extra_info": {
                                                                                "Table": "ducklake_file_column_stats",
                                                                                "Type": "Sequential Scan",
                                                                                "Projections": [
                                                                                    "max_value",
                                                                                    "min_value",
                                                                                    "data_file_id"
                                                                                ],
                                                                                "Filters": [
                                                                                    "table_id=155",
                                                                                    "column_id=19"
                                                                                ],
                                                                                "Estimated Cardinality": "9438"
                                                                            },
                                                                            "cpu_time": 0.0,
                                                                            "children": []
                                                                        }
                                                                    ]
                                                                }
                                                            ]
                                                        }
                                                    ]
                                                },
                                                {
                                                    "total_bytes_written": 0,
                                                    "total_bytes_read": 0,
                                                    "result_set_size": 32,
                                                    "operator_timing": 3.5e-7,
                                                    "operator_rows_scanned": 0,
                                                    "cumulative_rows_scanned": 0,
                                                    "operator_cardinality": 4,
                                                    "operator_type": "PROJECTION",
                                                    "operator_name": "PROJECTION",
                                                    "cumulative_cardinality": 0,
                                                    "extra_info": {
                                                        "Projections": "#2",
                                                        "Estimated Cardinality": "1887"
                                                    },
                                                    "cpu_time": 0.0,
                                                    "children": [
                                                        {
                                                            "total_bytes_written": 0,
                                                            "total_bytes_read": 0,
                                                            "result_set_size": 160,
                                                            "operator_timing": 0.0014502560000000015,
                                                            "operator_rows_scanned": 0,
                                                            "cumulative_rows_scanned": 0,
                                                            "operator_cardinality": 4,
                                                            "operator_type": "FILTER",
                                                            "operator_name": "FILTER",
                                                            "cumulative_cardinality": 0,
                                                            "extra_info": {
                                                                "Expression": "((max_value IS NULL) OR (min_value IS NULL) OR (('8900_10000' >= min_value) AND ('8900_10000' <= max_value)))",
                                                                "Estimated Cardinality": "1887"
                                                            },
                                                            "cpu_time": 0.0,
                                                            "children": [
                                                                {
                                                                    "total_bytes_written": 0,
                                                                    "total_bytes_read": 0,
                                                                    "result_set_size": 2629600,
                                                                    "operator_timing": 0.03502197799999997,
                                                                    "operator_rows_scanned": 19507982,
                                                                    "cumulative_rows_scanned": 0,
                                                                    "operator_cardinality": 65740,
                                                                    "operator_type": "TABLE_SCAN",
                                                                    "operator_name": "SEQ_SCAN ",
                                                                    "cumulative_cardinality": 0,
                                                                    "extra_info": {
                                                                        "Table": "ducklake_file_column_stats",
                                                                        "Type": "Sequential Scan",
                                                                        "Projections": [
                                                                            "max_value",
                                                                            "min_value",
                                                                            "data_file_id"
                                                                        ],
                                                                        "Filters": [
                                                                            "table_id=155",
                                                                            "column_id=19"
                                                                        ],
                                                                        "Estimated Cardinality": "9438"
                                                                    },
                                                                    "cpu_time": 0.0,
                                                                    "children": []
                                                                }
                                                            ]
                                                        }
                                                    ]
                                                }
                                            ]
                                        }
                                    ]
                                }
                            ]
                        }
                    ]
                }
            ]
        }
    ]
}

Future work:

  • Currently, it does not try and apply any sort of logic to explicitly rearrange or re-order the queries to be simpler and delegates all of that to the FilterCombiner likely there is further room for simplification between branches in the expression tree that the FilterCombiner does not currently handle
  • On equality constraints like X=Y AND X<400 we only add in the additional contstraint Y<400 but not the additional constraints that the max X >= min Y AND min X <= max Y
  • Maybe limit when these filters are applied or to what extent we recurse
  • Virtual column handling and virtual partition handling
  • There's no real way with EXPLAIN ANALYZE to analyze the time spent doing these queries instead I was just logging them and running them separately
  • Better deduplication of adding filters

Example:

Incoming Query:

FROM test_table
WHERE ((partition_col = 0 AND partition_col_2 > '9400_12699' AND partition_col_2 < '9400_12701')
OR (partition_col = 1 AND partition_col_2 = '9400_12700'))
OR (partition_col_2 = '8900_10000');

Previously this query that I actually ran (with the bounding comparisons replaced with an equality) queried all 65740 files, where now it queries just 5 relevant ones
Generated (after formatting with sqlglot):

SELECT
  data.path,
  data.path_is_relative,
  data.file_size_bytes,
  data.footer_size,
  data.row_id_start,
  data.begin_snapshot,
  data.partial_file_info,
  data.mapping_id,
  del.path,
  del.path_is_relative,
  del.file_size_bytes,
  del.footer_size
FROM "__ducklake_metadata_ducklake"."main".ducklake_data_file AS data
LEFT JOIN (
  SELECT
    *
  FROM "__ducklake_metadata_ducklake"."main".ducklake_delete_file
  WHERE
    table_id = 155
    AND 310 >= begin_snapshot
    AND (
      310 < end_snapshot OR end_snapshot IS NULL
    )
) AS del
  USING (data_file_id)
WHERE
  data.table_id = 155
  AND 310 >= data.begin_snapshot
  AND (
    310 < data.end_snapshot OR data.end_snapshot IS NULL
  )
  AND (
    (
      (
        data_file_id IN (
          SELECT
            data_file_id
          FROM "__ducklake_metadata_ducklake"."main".ducklake_file_column_stats
          WHERE
            table_id = 155
            AND column_id = 21
            AND (
              max_value IS NULL
              OR min_value IS NULL
              OR 0 BETWEEN TRY_CAST(min_value AS BIGINT) AND TRY_CAST(max_value AS BIGINT)
            )
        )
        AND data_file_id IN (
          SELECT
            data_file_id
          FROM "__ducklake_metadata_ducklake"."main".ducklake_file_column_stats
          WHERE
            table_id = 155
            AND column_id = 19
            AND (
              min_value IS NULL
              OR max_value IS NULL
              OR max_value > '9400_12699'
              AND min_value < '9400_12701'
            )
        )
      )
    )
    OR (
      (
        data_file_id IN (
          SELECT
            data_file_id
          FROM "__ducklake_metadata_ducklake"."main".ducklake_file_column_stats
          WHERE
            table_id = 155
            AND column_id = 21
            AND (
              max_value IS NULL
              OR min_value IS NULL
              OR 1 BETWEEN TRY_CAST(min_value AS BIGINT) AND TRY_CAST(max_value AS BIGINT)
            )
        )
        AND data_file_id IN (
          SELECT
            data_file_id
          FROM "__ducklake_metadata_ducklake"."main".ducklake_file_column_stats
          WHERE
            table_id = 155
            AND column_id = 19
            AND (
              max_value IS NULL
              OR min_value IS NULL
              OR '9400_12700' BETWEEN min_value AND max_value
            )
        )
      )
    )
    OR (
      data_file_id IN (
        SELECT
          data_file_id
        FROM "__ducklake_metadata_ducklake"."main".ducklake_file_column_stats
        WHERE
          table_id = 155
          AND column_id = 19
          AND (
            max_value IS NULL
            OR min_value IS NULL
            OR '8900_10000' BETWEEN min_value AND max_value
          )
      )
    )
  )

Disclaimer: I used AI to write the tests provided here and then manually modified them, just due to the necessity to cover a lot of cases and annotated them with where our performance could be improved a bit

@J-Meyers
Copy link
Contributor Author

Tests failure is unrelated I think:

FAILED: [code=1] CMakeFiles/duckdb_local_extension_repo /home/runner/work/ducklake/ducklake/build/reldebug/CMakeFiles/duckdb_local_extension_repo 
cd /home/runner/work/ducklake/ducklake/duckdb && /usr/bin/python3.12 scripts/create_local_extension_repo.py v1.4.0 /home/runner/work/ducklake/ducklake/build/reldebug/duckdb_platform_out /home/runner/work/ducklake/ducklake/build/reldebug /home/runner/work/ducklake/ducklake/build/reldebug/repository duckdb_extension
Traceback (most recent call last):
  File "/home/runner/work/ducklake/ducklake/duckdb/scripts/create_local_extension_repo.py", line 43, in <module>
    shutil.copy(file, dest_file)
  File "/usr/lib/python3.12/shutil.py", line 435, in copy
    copyfile(src, dst, follow_symlinks=follow_symlinks)
  File "/usr/lib/python3.12/shutil.py", line 273, in copyfile
    _fastcopy_sendfile(fsrc, fdst)
  File "/usr/lib/python3.12/shutil.py", line 164, in _fastcopy_sendfile
    raise err from None
  File "/usr/lib/python3.12/shutil.py", line 150, in _fastcopy_sendfile
Error: No space left on device : '/home/runner/actions-runner/cached/_diag/pages/f7239b5f-7c77-48a0-b23b-cc692e015a69_020c01b5-1a7b-5dad-a8c4-bcc03a5a03be_1.log'

@J-Meyers J-Meyers force-pushed the complex_pushdown_filter branch from d94bdab to 45fb399 Compare September 26, 2025 17:55
@J-Meyers
Copy link
Contributor Author

J-Meyers commented Sep 26, 2025

Update I implemented using the CTEs for column stats, this changes the generated sql to below.

I add MATERIALIZATED hint whenever it is referenced multiple times, although duckdb defaults to materializing I wasn't sure if other databases used as the metastore would.

This change to using CTEs reduced the slowdown observed now its ~.12 seconds instead of ~.25 seconds, but is still there. Running the query alone via CLI takes ~.037 seconds

Updated detailed profiling output

┌────────────────────────────────────────────────┐
│┌──────────────────────────────────────────────┐│
││              Total Time: 0.126s              ││
│└──────────────────────────────────────────────┘│
└────────────────────────────────────────────────┘
┌────────────────────────────────────────────────┐
│               Optimizer: 0.0004s               │
│┌──────────────────────────────────────────────┐│
││        Build Side Probe Side: 0.0000s        ││
││           Column Lifetime: 0.0000s           ││
││           Common Aggregate: 0.0000s          ││
││        Common Subexpressions: 0.0000s        ││
││      Compressed Materialization: 0.0000s     ││
││          Cte Filter Pusher: 0.0000s          ││
││             Cte Inlining: 0.0000s            ││
││             Deliminator: 0.0000s             ││
││           Duplicate Groups: 0.0000s          ││
││         Empty Result Pullup: 0.0000s         ││
││         Expression Rewriter: 0.0001s         ││
││              Extension: 0.0000s              ││
││            Filter Pullup: 0.0000s            ││
││           Filter Pushdown: 0.0001s           ││
││              In Clause: 0.0000s              ││
││         Join Filter Pushdown: 0.0000s        ││
││              Join Order: 0.0001s             ││
││         Late Materialization: 0.0000s        ││
││            Limit Pushdown: 0.0000s           ││
││           Materialized Cte: 0.0000s          ││
││             Regex Range: 0.0000s             ││
││            Reorder Filter: 0.0000s           ││
││          Sampling Pushdown: 0.0000s          ││
││        Statistics Propagation: 0.0000s       ││
││             Sum Rewriter: 0.0000s            ││
││                Top N: 0.0000s                ││
││           Unnest Rewriter: 0.0000s           ││
││            Unused Columns: 0.0000s           ││
│└──────────────────────────────────────────────┘│
└────────────────────────────────────────────────┘
┌────────────────────────────────────────────────┐
│            Physical planner: 0.0000s           │
│┌──────────────────────────────────────────────┐│
││            Column Binding: 0.0000s           ││
││             Create Plan: 0.0000s             ││
││            Resolve Types: 0.0000s            ││
│└──────────────────────────────────────────────┘│
└────────────────────────────────────────────────┘
┌────────────────────────────────────────────────┐
│                Planner: 0.0002s                │
│┌──────────────────────────────────────────────┐│
││               Binding: 0.0002s               ││
│└──────────────────────────────────────────────┘│
└────────────────────────────────────────────────┘
┌───────────────────────┐
│         QUERY         │
└───────────┬───────────┘
┌───────────┴───────────┐
│          CTE          │
│    ────────────────   │
│       CTE Name:       │
│      col_21_stats     │
│                       ├────────────┐
│     Table Index: 0    │            │
│                       │            │
│         0 rows        │            │
│        (0.00s)        │            │
└───────────┬───────────┘            │
┌───────────┴───────────┐┌───────────┴───────────┐
│         FILTER        ││          CTE          │
│    ────────────────   ││    ────────────────   │
│ (((max_value IS NULL) ││       CTE Name:       │
│  OR (min_value IS NULL││      col_19_stats     │
│  ) OR ((0 >= TRY_CAST ││                       │
│ (min_value AS BIGINT))││     Table Index: 8    │
│   AND (0 <= TRY_CAST  ││                       │
│ (max_value AS BIGINT))││                       │
│  )) OR ((max_value IS ││                       │
│   NULL) OR (min_value ││                       ├────────────┐
│   IS NULL) OR ((1 >=  ││                       │            │
│  TRY_CAST(min_value AS││                       │            │
│   BIGINT)) AND (1 <=  ││                       │            │
│  TRY_CAST(max_value AS││                       │            │
│       BIGINT)))))     ││                       │            │
│                       ││                       │            │
│      64,894 rows      ││         0 rows        │            │
│        (0.00s)        ││        (0.00s)        │            │
└───────────┬───────────┘└───────────┬───────────┘            │
┌───────────┴───────────┐┌───────────┴───────────┐┌───────────┴───────────┐
│       TABLE_SCAN      ││         FILTER        ││       PROJECTION      │
│    ────────────────   ││    ────────────────   ││    ────────────────   │
│         Table:        ││ (((max_value IS NULL) ││          path         │
│ducklake_file_column_st││  OR (min_value IS NULL││    path_is_relative   │
│          ats          ││ ) OR (('8900_10000' >=││    file_size_bytes    │
│                       ││    min_value) AND (   ││      footer_size      │
│         Type:         ││    '8900_10000' <=    ││      row_id_start     │
│    Sequential Scan    ││   max_value))) OR ((  ││     begin_snapshot    │
│                       ││ (min_value IS NULL) OR││   partial_file_info   │
│      Projections:     ││   (max_value IS NULL) ││       mapping_id      │
│      data_file_id     ││    OR ((max_value >   ││          path         │
│       max_value       ││   '9400_12699') AND   ││    path_is_relative   │
│       min_value       ││     (min_value <      ││    file_size_bytes    │
│                       ││  '9400_12701'))) OR ( ││      footer_size      │
│        Filters:       ││ (max_value IS NULL) OR││                       │
│      column_id=21     ││   (min_value IS NULL) ││                       │
│      table_id=155     ││  OR (('9400_12700' >= ││                       │
│                       ││    min_value) AND (   ││                       │
│                       ││    '9400_12700' <=    ││                       │
│                       ││     max_value)))))    ││                       │
│                       ││                       ││                       │
│      65,740 rows      ││         5 rows        ││         5 rows        │
│        (0.06s)        ││        (0.00s)        ││        (0.00s)        │
└───────────────────────┘└───────────┬───────────┘└───────────┬───────────┘
                         ┌───────────┴───────────┐┌───────────┴───────────┐
                         │       TABLE_SCAN      ││       PROJECTION      │
                         │    ────────────────   ││    ────────────────   │
                         │         Table:        ││           #0          │
                         │ducklake_file_column_st││           #1          │
                         │          ats          ││           #2          │
                         │                       ││           #3          │
                         │         Type:         ││           #4          │
                         │    Sequential Scan    ││           #5          │
                         │                       ││           #6          │
                         │      Projections:     ││           #7          │
                         │      data_file_id     ││           #8          │
                         │       min_value       ││           #9          │
                         │       max_value       ││          #10          │
                         │                       ││          #11          │
                         │        Filters:       ││                       │
                         │      column_id=19     ││                       │
                         │      table_id=155     ││                       │
                         │                       ││                       │
                         │      65,740 rows      ││         5 rows        │
                         │        (0.04s)        ││        (0.00s)        │
                         └───────────────────────┘└───────────┬───────────┘
                                                  ┌───────────┴───────────┐
                                                  │         FILTER        │
                                                  │    ────────────────   │
                                                  │ (SUBQUERY OR (SUBQUERY│
                                                  │    AND SUBQUERY) OR   │
                                                  │ (SUBQUERY AND SUBQUERY│
                                                  │           ))          │
                                                  │                       │
                                                  │         5 rows        │
                                                  │        (0.00s)        │
                                                  └───────────┬───────────┘
                                                  ┌───────────┴───────────┐
                                                  │       HASH_JOIN       │
                                                  │    ────────────────   │
                                                  │    Join Type: MARK    │
                                                  │                       │
                                                  │      Conditions:      ├─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
                                                  │   data_file_id = #0   │                                                                                                                                         │
                                                  │                       │                                                                                                                                         │
                                                  │      65,740 rows      │                                                                                                                                         │
                                                  │        (0.00s)        │                                                                                                                                         │
                                                  └───────────┬───────────┘                                                                                                                                         │
                                                  ┌───────────┴───────────┐                                                                                                                             ┌───────────┴───────────┐
                                                  │       HASH_JOIN       │                                                                                                                             │       PROJECTION      │
                                                  │    ────────────────   │                                                                                                                             │    ────────────────   │
                                                  │    Join Type: MARK    │                                                                                                                             │           #0          │
                                                  │                       │                                                                                                                             │                       │
                                                  │      Conditions:      ├────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐            │                       │
                                                  │   data_file_id = #0   │                                                                                                                │            │                       │
                                                  │                       │                                                                                                                │            │                       │
                                                  │      65,740 rows      │                                                                                                                │            │         4 rows        │
                                                  │        (0.00s)        │                                                                                                                │            │        (0.00s)        │
                                                  └───────────┬───────────┘                                                                                                                │            └───────────┬───────────┘
                                                  ┌───────────┴───────────┐                                                                                                    ┌───────────┴───────────┐┌───────────┴───────────┐
                                                  │       HASH_JOIN       │                                                                                                    │       PROJECTION      ││         FILTER        │
                                                  │    ────────────────   │                                                                                                    │    ────────────────   ││    ────────────────   │
                                                  │    Join Type: MARK    │                                                                                                    │           #0          ││  ((max_value IS NULL) │
                                                  │                       │                                                                                                    │                       ││  OR (min_value IS NULL│
                                                  │      Conditions:      │                                                                                                    │                       ││ ) OR (('8900_10000' >=│
                                                  │   data_file_id = #0   ├───────────────────────────────────────────────────────────────────────────────────────┐            │                       ││    min_value) AND (   │
                                                  │                       │                                                                                       │            │                       ││    '8900_10000' <=    │
                                                  │                       │                                                                                       │            │                       ││      max_value)))     │
                                                  │                       │                                                                                       │            │                       ││                       │
                                                  │      65,740 rows      │                                                                                       │            │      30,984 rows      ││         4 rows        │
                                                  │        (0.00s)        │                                                                                       │            │        (0.00s)        ││        (0.00s)        │
                                                  └───────────┬───────────┘                                                                                       │            └───────────┬───────────┘└───────────┬───────────┘
                                                  ┌───────────┴───────────┐                                                                           ┌───────────┴───────────┐┌───────────┴───────────┐┌───────────┴───────────┐
                                                  │       HASH_JOIN       │                                                                           │       PROJECTION      ││         FILTER        ││        CTE_SCAN       │
                                                  │    ────────────────   │                                                                           │    ────────────────   ││    ────────────────   ││    ────────────────   │
                                                  │    Join Type: MARK    │                                                                           │           #0          ││  ((max_value IS NULL) ││      CTE Index: 8     │
                                                  │                       │                                                                           │                       ││  OR (min_value IS NULL││                       │
                                                  │      Conditions:      │                                                                           │                       ││  ) OR ((1 >= TRY_CAST ││                       │
                                                  │   data_file_id = #0   │                                                                           │                       ││ (min_value AS BIGINT))││                       │
                                                  │                       ├──────────────────────────────────────────────────────────────┐            │                       ││   AND (1 <= TRY_CAST  ││                       │
                                                  │                       │                                                              │            │                       ││ (max_value AS BIGINT))││                       │
                                                  │                       │                                                              │            │                       ││           ))          ││                       │
                                                  │                       │                                                              │            │                       ││                       ││                       │
                                                  │      65,740 rows      │                                                              │            │         1 row         ││      30,984 rows      ││         5 rows        │
                                                  │        (0.00s)        │                                                              │            │        (0.00s)        ││        (0.00s)        ││        (0.00s)        │
                                                  └───────────┬───────────┘                                                              │            └───────────┬───────────┘└───────────┬───────────┘└───────────────────────┘
                                                  ┌───────────┴───────────┐                                                  ┌───────────┴───────────┐┌───────────┴───────────┐┌───────────┴───────────┐
                                                  │       HASH_JOIN       │                                                  │       PROJECTION      ││         FILTER        ││        CTE_SCAN       │
                                                  │    ────────────────   │                                                  │    ────────────────   ││    ────────────────   ││    ────────────────   │
                                                  │    Join Type: MARK    │                                                  │           #0          ││  ((max_value IS NULL) ││      CTE Index: 0     │
                                                  │                       │                                                  │                       ││  OR (min_value IS NULL││                       │
                                                  │      Conditions:      │                                                  │                       ││ ) OR (('9400_12700' >=││                       │
                                                  │   data_file_id = #0   ├─────────────────────────────────────┐            │                       ││    min_value) AND (   ││                       │
                                                  │                       │                                     │            │                       ││    '9400_12700' <=    ││                       │
                                                  │                       │                                     │            │                       ││      max_value)))     ││                       │
                                                  │                       │                                     │            │                       ││                       ││                       │
                                                  │      65,740 rows      │                                     │            │      33,910 rows      ││         1 row         ││      64,894 rows      │
                                                  │        (0.00s)        │                                     │            │        (0.00s)        ││        (0.00s)        ││        (0.00s)        │
                                                  └───────────┬───────────┘                                     │            └───────────┬───────────┘└───────────┬───────────┘└───────────────────────┘
                                                  ┌───────────┴───────────┐                         ┌───────────┴───────────┐┌───────────┴───────────┐┌───────────┴───────────┐
                                                  │       HASH_JOIN       │                         │       PROJECTION      ││         FILTER        ││        CTE_SCAN       │
                                                  │    ────────────────   │                         │    ────────────────   ││    ────────────────   ││    ────────────────   │
                                                  │    Join Type: LEFT    │                         │           #0          ││  ((max_value IS NULL) ││      CTE Index: 8     │
                                                  │                       │                         │                       ││  OR (min_value IS NULL││                       │
                                                  │      Conditions:      │                         │                       ││  ) OR ((0 >= TRY_CAST ││                       │
                                                  │     data_file_id =    │                         │                       ││ (min_value AS BIGINT))││                       │
                                                  │      data_file_id     ├────────────┐            │                       ││   AND (0 <= TRY_CAST  ││                       │
                                                  │                       │            │            │                       ││ (max_value AS BIGINT))││                       │
                                                  │                       │            │            │                       ││           ))          ││                       │
                                                  │                       │            │            │                       ││                       ││                       │
                                                  │      65,740 rows      │            │            │         1 row         ││      33,910 rows      ││         5 rows        │
                                                  │        (0.00s)        │            │            │        (0.00s)        ││        (0.00s)        ││        (0.00s)        │
                                                  └───────────┬───────────┘            │            └───────────┬───────────┘└───────────┬───────────┘└───────────────────────┘
                                                  ┌───────────┴───────────┐┌───────────┴───────────┐┌───────────┴───────────┐┌───────────┴───────────┐
                                                  │       TABLE_SCAN      ││       TABLE_SCAN      ││         FILTER        ││        CTE_SCAN       │
                                                  │    ────────────────   ││    ────────────────   ││    ────────────────   ││    ────────────────   │
                                                  │         Table:        ││         Table:        ││  ((min_value IS NULL) ││      CTE Index: 0     │
                                                  │   ducklake_data_file  ││  ducklake_delete_file ││  OR (max_value IS NULL││                       │
                                                  │                       ││                       ││  ) OR ((max_value >   ││                       │
                                                  │         Type:         ││         Type:         ││   '9400_12699') AND   ││                       │
                                                  │    Sequential Scan    ││    Sequential Scan    ││     (min_value <      ││                       │
                                                  │                       ││                       ││    '9400_12701')))    ││                       │
                                                  │      Projections:     ││      Projections:     ││                       ││                       │
                                                  │      data_file_id     ││      data_file_id     ││                       ││                       │
                                                  │     begin_snapshot    ││          path         ││                       ││                       │
                                                  │          path         ││    path_is_relative   ││                       ││                       │
                                                  │    path_is_relative   ││    file_size_bytes    ││                       ││                       │
                                                  │    file_size_bytes    ││      footer_size      ││                       ││                       │
                                                  │      footer_size      ││                       ││                       ││                       │
                                                  │      row_id_start     ││        Filters:       ││                       ││                       │
                                                  │   partial_file_info   ││      table_id=155     ││                       ││                       │
                                                  │       mapping_id      ││  begin_snapshot<=310  ││                       ││                       │
                                                  │                       ││  (310 < end_snapshot) ││                       ││                       │
                                                  │        Filters:       ││                       ││                       ││                       │
                                                  │      table_id=155     ││                       ││                       ││                       │
                                                  │ ((310 < end_snapshot) ││                       ││                       ││                       │
                                                  │   OR (end_snapshot IS ││                       ││                       ││                       │
                                                  │         NULL))        ││                       ││                       ││                       │
                                                  │                       ││                       ││                       ││                       │
                                                  │      65,740 rows      ││         0 rows        ││         1 row         ││      64,894 rows      │
                                                  │        (0.01s)        ││        (0.00s)        ││        (0.00s)        ││        (0.00s)        │
                                                  └───────────────────────┘└───────────────────────┘└───────────┬───────────┘└───────────────────────┘
                                                                                                    ┌───────────┴───────────┐
                                                                                                    │        CTE_SCAN       │
                                                                                                    │    ────────────────   │
                                                                                                    │      CTE Index: 8     │
                                                                                                    │                       │
                                                                                                    │         5 rows        │
                                                                                                    │        (0.00s)        │
                                                                                                    └───────────────────────┘

WITH col_21_stats AS MATERIALIZED (
  SELECT
    data_file_id,
    max_value,
    min_value
  FROM "__ducklake_metadata_ducklake"."main".ducklake_file_column_stats
  WHERE
    column_id = 21 AND table_id = 155
), col_19_stats AS MATERIALIZED (
  SELECT
    data_file_id,
    min_value,
    max_value
  FROM "__ducklake_metadata_ducklake"."main".ducklake_file_column_stats
  WHERE
    column_id = 19 AND table_id = 155
)
SELECT
  data.path,
  data.path_is_relative,
  data.file_size_bytes,
  data.footer_size,
  data.row_id_start,
  data.begin_snapshot,
  data.partial_file_info,
  data.mapping_id,
  del.path,
  del.path_is_relative,
  del.file_size_bytes,
  del.footer_size
FROM "__ducklake_metadata_ducklake"."main".ducklake_data_file AS data
LEFT JOIN (
  SELECT
    *
  FROM "__ducklake_metadata_ducklake"."main".ducklake_delete_file
  WHERE
    table_id = 155
    AND 310 >= begin_snapshot
    AND (
      310 < end_snapshot OR end_snapshot IS NULL
    )
) AS del
  USING (data_file_id)
WHERE
  data.table_id = 155
  AND 310 >= data.begin_snapshot
  AND (
    310 < data.end_snapshot OR data.end_snapshot IS NULL
  )
  AND (
    (
      (
        data_file_id IN (
          SELECT
            data_file_id
          FROM col_19_stats
          WHERE
            min_value IS NULL
            OR max_value IS NULL
            OR max_value > '9400_12699'
            AND min_value < '9400_12701'
        )
        AND data_file_id IN (
          SELECT
            data_file_id
          FROM col_21_stats
          WHERE
            max_value IS NULL
            OR min_value IS NULL
            OR 0 BETWEEN TRY_CAST(min_value AS BIGINT) AND TRY_CAST(max_value AS BIGINT)
        )
      )
    )
    OR (
      (
        data_file_id IN (
          SELECT
            data_file_id
          FROM col_19_stats
          WHERE
            max_value IS NULL
            OR min_value IS NULL
            OR '9400_12700' BETWEEN min_value AND max_value
        )
        AND data_file_id IN (
          SELECT
            data_file_id
          FROM col_21_stats
          WHERE
            max_value IS NULL
            OR min_value IS NULL
            OR 1 BETWEEN TRY_CAST(min_value AS BIGINT) AND TRY_CAST(max_value AS BIGINT)
        )
      )
    )
    OR (
      data_file_id IN (
        SELECT
          data_file_id
        FROM col_19_stats
        WHERE
          max_value IS NULL
          OR min_value IS NULL
          OR '8900_10000' BETWEEN min_value AND max_value
      )
    )
  )

@pdet
Copy link
Collaborator

pdet commented Sep 29, 2025

Hi @J-Meyers thanks for the PR, I skimmed over the code, and it seems to go in the right direction. I'll have a deeper look later this week, probably tomorrow.

The CI failure is definitely unrelated, and should now be fixed, could you merge again?

Regarding the tests, unfortunately, AI-generated code/tests go against our guidelines, since these add a high burden on the team to review the code and to maintain it in the future.

Could you replace the test with isolated portions (having a minimal number of columns and expressions to express the desired test). Also, break it down into multiple file tests to make it more modularized.

Thanks again!

@J-Meyers
Copy link
Contributor Author

Hi @J-Meyers thanks for the PR, I skimmed over the code, and it seems to go in the right direction. I'll have a deeper look later this week, probably tomorrow.

The CI failure is definitely unrelated, and should now be fixed, could you merge again?

Regarding the tests, unfortunately, AI-generated code/tests go against our guidelines, since these add a high burden on the team to review the code and to maintain it in the future.

Could you replace the test with isolated portions (having a minimal number of columns and expressions to express the desired test). Also, break it down into multiple file tests to make it more modularized.

Thanks again!

@pdet

Happy to write the tests if this seems like something that could get merged.

Are there any particular things you want to see tested?

The general things I was looking for:

  1. The basic non complex pushdown
  2. Basic examples of unsatisfiable resulting in EMPTY_RESULT
  3. Unfilterable cases (I used LIKE '%pattern') AS a stand in for some expression that we can't handle, filtering should happen with ANDS, but not ORs
  4. ORs with multiple columns
  5. ORs with unsatisfiable condition branches having those branches pruned if they had other checks within them as well
  6. AND with unsatisfiable condition completely pruned
  7. Misc. Mixing of multiple columns within OR and AND tree structure with = and !=
  8. Inequality comparisons with within the misc mixing
  9. Filters that while not unsatisfiable just by their logic, due to the nature of the file ranges have 0 files
  10. Propagating X = 1, Y=X to Y=1 conditions nested and with multiple columns
  11. Same as 10, but with inequalities
  12. !=, >=, <=, <, >, IN, IS NULL, IS NOT NULL nested and with mulitple columns

I will wait to push merge with dev until tests are written to avoid running CI twice

@pdet pdet requested a review from samansmink September 29, 2025 18:11
@pdet
Copy link
Collaborator

pdet commented Sep 29, 2025

Hi @J-Meyers thanks for the PR, I skimmed over the code, and it seems to go in the right direction. I'll have a deeper look later this week, probably tomorrow.
The CI failure is definitely unrelated, and should now be fixed, could you merge again?
Regarding the tests, unfortunately, AI-generated code/tests go against our guidelines, since these add a high burden on the team to review the code and to maintain it in the future.
Could you replace the test with isolated portions (having a minimal number of columns and expressions to express the desired test). Also, break it down into multiple file tests to make it more modularized.
Thanks again!

@pdet

Happy to write the tests if this seems like something that could get merged.

Are there any particular things you want to see tested?

The general things I was looking for:

  1. The basic non complex pushdown
  2. Basic examples of unsatisfiable resulting in EMPTY_RESULT
  3. Unfilterable cases (I used LIKE '%pattern') AS a stand in for some expression that we can't handle, filtering should happen with ANDS, but not ORs
  4. ORs with multiple columns
  5. ORs with unsatisfiable condition branches having those branches pruned if they had other checks within them as well
  6. AND with unsatisfiable condition completely pruned
  7. Misc. Mixing of multiple columns within OR and AND tree structure with = and !=
  8. Inequality comparisons with within the misc mixing
  9. Filters that while not unsatisfiable just by their logic, due to the nature of the file ranges have 0 files
  10. Propagating X = 1, Y=X to Y=1 conditions nested and with multiple columns
  11. Same as 10, but with inequalities
  12. !=, >=, <=, <, >, IN, IS NULL, IS NOT NULL nested and with mulitple columns

I will wait to push merge with dev until tests are written to avoid running CI twice

From my point of view we should initially focus on supporting and testing the cases covered by the filter combiner, since that's also what iceberg and delta supports, but I'm not super involved with the Complex/Dynamic-Filter, so I've asked @samansmink to review the PR and maybe he has a better answer to it :-)

@J-Meyers
Copy link
Contributor Author

Added new smaller tests in separate files, these aren't as extensive as the others, but should cover the major cases.

I improved the logic handling when a complex filter is followed by a dynamic filter or more complex filters so that it keeps all the initial filters and only adds new ones if they're not identical to existing ones (if they're functionally the same, but not actually identical this is not recognized).

Looking forward to a review.

@pdet
Copy link
Collaborator

pdet commented Oct 1, 2025

Hi @J-Meyers thanks for the effort in cleaning up the PR, is indeed much more readable now, I'm not a File Reader Filter pushdown specialist, but I was a bit surprised with the complexity of the code when compared to iceberg, where it seems that the FilterCombiner is doing all the heavy lifting. From your comment it seems that you are doing extra-work that the FilterCombiner is not capable? If so, what is that and wouldn't the FilterCombiner itself be a better fit for that side of the implementation?

@J-Meyers
Copy link
Contributor Author

J-Meyers commented Oct 1, 2025

Hi @J-Meyers thanks for the effort in cleaning up the PR, is indeed much more readable now, I'm not a File Reader Filter pushdown specialist, but I was a bit surprised with the complexity of the code when compared to iceberg, where it seems that the FilterCombiner is doing all the heavy lifting. From your comment it seems that you are doing extra-work that the FilterCombiner is not capable? If so, what is that and wouldn't the FilterCombiner itself be a better fit for that side of the implementation?

This is doing more work that the FilterCombiner is not capable of, if you try any other scanner on these tests it will fail, currently the return type of the filter combiner is a TableFilterSet which gives one set of restrictions per column and then all restrictions must be AND ed together from all of the columns. This sort of filtering is appropriate when reading the metadata from the columns is quite expensive and expressing more comlex logic is difficult; however, with ducklake we have the nice advantage that all of the metadata is in a database accessible by SQL so we need not be restricted to expressing a single filter per column and anding them all together.

The TableFilterSet cannot express (X = 1 AND Y =2) OR (X = 3 AND Y = 4) or even just X = 1 OR Y = 2 and it would be relatively difficult for another table scanner to evaluate this set of filters because then they have to reinvent a full expression tree evaluator, where with ducklake that's just what the metadata database is already good at.

I'll also say that specifically with iceberg lots of the logic is hidden in other places for when the filters are actually evaluated, not when they are pushed down, which in iceberg is just adding them to a TableFilterSet.

These other multifile list scanners like delta and iceberg don't actually support complex pushdown, instead as their first step they simplify the complex filters into a simple TableFilterSet that can then be evaluated because evaluating arbitrary expressions unless you already have an engine for it is hard and there are more significant costs to reading in the data.

Edit: The current test failure seems unrelated, it never runs anything along the multi_file_list pushdown codepaths. And that test passed ona different architecture.

Copy link
Collaborator

@pdet pdet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @J-Meyers, I think things generally look very good! Thanks again for all the effort!

I had some comments and clarification questions (that could potentially become code comments), but otherwise it looks great.

One thing I’m thinking: it’s probably worth running the DuckDB tests on this to improve correctness. A nice way of achieving this could be to have a CI run with the target_file_size limited to a very small size.

I think that can be achievable by having an extra run at https://github.com/duckdb/ducklake/blob/main/.github/workflows/ConfigTests.yml with a new config file using sth like CALL my_ducklake.set_option('target_file_size', 'something_sensible'); at https://github.com/duckdb/ducklake/blob/main/test/configs/attach_ducklake.json#L3

Otherwise, I'll leave it to @samansmink to do another review pass!

Comment on lines 91 to 100
//! Deferred filter evaluation state
mutable bool filters_evaluated = false;
//! Complex filters stored as vector of expressions for deferred evaluation
mutable vector<unique_ptr<Expression>> pending_complex_filters;
//! Dynamic filters stored as TableFilterSet for deferred evaluation
mutable TableFilterSet pending_dynamic_filters;
//! Column information for deferred filter evaluation
mutable vector<column_t> deferred_column_ids;
//! ClientContext for deferred filter evaluation
mutable ClientContext *deferred_context = nullptr;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we wrap this to a DeferredFilters struct/class? with a description also what a deferred filter is, in this context. We could probably separate the files as well.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also move methods related to them to the class.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created a struct with a Copy function. However, none of the other functions are currently associated with the class, I was trying to keep most of that within the .cpp file to avoid binary bloat. What functions do you want associated with the class?

Copy link
Contributor Author

@J-Meyers J-Meyers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I addressed all the core code comments.

For the small files I ended up running with 1KB that felt really small to me, but maybe even smaller is appropriate.

Comment on lines 91 to 100
//! Deferred filter evaluation state
mutable bool filters_evaluated = false;
//! Complex filters stored as vector of expressions for deferred evaluation
mutable vector<unique_ptr<Expression>> pending_complex_filters;
//! Dynamic filters stored as TableFilterSet for deferred evaluation
mutable TableFilterSet pending_dynamic_filters;
//! Column information for deferred filter evaluation
mutable vector<column_t> deferred_column_ids;
//! ClientContext for deferred filter evaluation
mutable ClientContext *deferred_context = nullptr;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created a struct with a Copy function. However, none of the other functions are currently associated with the class, I was trying to keep most of that within the .cpp file to avoid binary bloat. What functions do you want associated with the class?

Comment on lines 488 to 502
// Check for duplicates while building the new filter list
vector<unique_ptr<Expression>> new_filters_to_add;

for (auto &filter : filters) {
bool is_duplicate = false;
for (auto &existing_filter : pending_complex_filters) {
if (filter->Equals(*existing_filter)) {
is_duplicate = true;
break;
}
}
if (!is_duplicate) {
new_filters_to_add.push_back(filter->Copy());
}
}
Copy link
Contributor Author

@J-Meyers J-Meyers Oct 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure the underlying cause so don't want to annotate it with something wrong, but for the same query ComplexFilterPushdown is called repeatedly, for example for the query that is the first test in equality_propagation.test

query I
SELECT COUNT(*) FROM ducklake.test_table WHERE id = 50 AND id = duplicate_id;
----
1

It is first called with filters:

(id = duplicate_id)
(id = 50)
(duplicate_id = 50)

But then again called with filter:

(id = duplicate_id)

But for example this also happens with relatively simple cases too where for the first case of basic_cross_column_pushdown.test

query II
EXPLAIN ANALYZE SELECT COUNT(*) FROM ducklake.test_table WHERE id = 50 OR category = 3;
----
analyzed_plan	<REGEX>:.*Total Files Read: 2.*

It is called twice with the same filters of:

((id = 50) OR (category = 3))

I wanted to handle whatever they push, in all of the ducklake tests I have not observed multiple calls producing new filters, but when running the join tests in the duckdb repo there does seem to be additional new filters. If they do produce different filters that need to be unified, my interpretation was that they are all additive since any one individually must be valid, which is why I add any new ones to the same vector of expressions that are all eventually ANDed together so that we are maximally restrictive in the list that we provide.

Example that produces unique filters;
(from duckdb test/sql/join/inner/test_using_join.test)

statement ok
CREATE TABLE t1 (a INTEGER, b INTEGER, c INTEGER);

statement ok
INSERT INTO t1 VALUES (1,2,3);

statement ok
CREATE TABLE t2 (a INTEGER, b INTEGER, c INTEGER);

statement ok
INSERT INTO t2 VALUES (1,2,3), (2,2,4), (1,3,4);

query III
SELECT t2.a, t2.b, t2.c FROM t1 JOIN t2 USING(a,b)
----
1	2	3

The first call supplies filter: (a <= 1)
The second call supplies filter: (b <= 2)

@J-Meyers
Copy link
Contributor Author

J-Meyers commented Oct 2, 2025

This test failed but was unrelated prior to adding the small file stuff
For context:

SELECT (SELECT rowid FROM a LIMIT 1)

Running the test locally I was able to very inconsistently replicate the error
It's because the underlying test assumes that the rows will come in rowid order, which isn't necessarily true when running with ducklake. We try and get around with that by the rowsort, but that doesn't handle it in this case because it requires the sortedness prior to the final output. I am adding this test to the exclusion list.

I had to exclude the read only tests from duckdb since we have to modify the database to set the target file size.

You can see that all tests passed once those were excluded here: https://github.com/duckdb/ducklake/actions/runs/18204520486/job/51831513956?pr=477

The current failure is a seemingly unrelated error that seems to happen spuriously see the CI that adds the != comparison, but passes on all subsequent test runs: https://github.com/duckdb/ducklake/actions/runs/18170768231/job/51724562395

@J-Meyers J-Meyers requested a review from pdet October 3, 2025 13:41
@pdet
Copy link
Collaborator

pdet commented Oct 7, 2025

Hi @J-Meyers, thanks again for all the adjustments!

I had a quick internal chat about this PR, and indeed, as I had pointed out before, the work on generating the filters should be handled by the filter combiner rather than in DuckLake, since these filters can also be pushed down for Iceberg and Delta.

The idea would be to move that part of the code to DuckDB. We could perhaps make the filter combiner configurable in terms of which filters it should push down, depending on the scanner, or scanners can also ignore filters they can't pushdown, although i believe the former is cleaner.

In DuckLake, we should keep the filter to query transformation and the small file CI.

@Mytherin, maybe you’d like to expand on this?

@J-Meyers
Copy link
Contributor Author

J-Meyers commented Oct 7, 2025

What should the output type be?

Right now filters can be only of a single column, and TableFilterSets can only express a series of restrictions on various single columns all ANDed together, so returning an Expression seems like it would make sense, but transforming from an expression to filters is a lot of the work of the FilterCombiner, and that would leave that transformation to the downstream users somehow.

The FilterCombiner itself can only represent equivalence sets at a single level really right now, how would it work, should I still recursively create FilterCombiner s

Copy link

@samansmink samansmink left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really cool work @J-Meyers!

I took a look here and my general impression is that this is a high quality PR adding a desired feature. I have a bit of a worry though, and that is that this PR is quite complex. It's quite dense to review and making changes to or debugging this code in the future would require significant effort from our side. This fact combined with what @pdet stated about wanting to support this generically for DuckDB to also serve the delta and iceberg extensions, makes me slightly hesitant here.

My recommendation would to either:

  • split this up into more easily reviewable chunks, to ensure this is more manageable to maintain and change once the duckdb team implements more of this generically duckdb side
  • immediately switch to implementing the complex parts of this PR DuckDB, side potentially with some guidance from our end

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants