Filter CTE Generation #514

J-Meyers · 2025-10-14T04:32:13Z

This modifies the GetFilesForTable and DuckLakeMultiFileList to first generate CTEs for each of the specific column stats.

This does not change the actual filters that are being run, it only changes the shape of the query. It includes logic to explicitly materialize the CTE if there are multiple references, and explicitly says not materialized when it is only 1 rather than leaving it up to the optimizer. Since it is currently run on a single TableFilter the reference count is always 1 and they are never marked as materialized.

Part of the continuation of breaking up #477 into more reviewable chunks as suggested

This makes a material difference on query time only when the same column is referenced multiple times, as can happen with complex filters (which are not yet processed) because in ducklakes with lots of table with lots of columns a very significant portion of the total time is just looking up the column stats, not comparing them, so doing that multiple times really hurts query execution time.

Sample user query:

SELECT COUNT(*) FROM ducklake.filter_pushdown WHERE i IN (500, 600, 700);

Sample generated query:

WITH col_2_stats AS NOT MATERIALIZED (
  SELECT data_file_id, max_value, min_value
  FROM {METADATA_CATALOG}.ducklake_file_column_stats
  WHERE column_id = 2 AND table_id = 1
)

SELECT data.path, data.path_is_relative, data.file_size_bytes, data.footer_size, data.row_id_start, data.begin_snapshot, data.partial_file_info, data.mapping_id, del.path, del.path_is_relative, del.file_size_bytes, del.footer_size
FROM {METADATA_CATALOG}.ducklake_data_file data
LEFT JOIN (
    SELECT *
    FROM {METADATA_CATALOG}.ducklake_delete_file
    WHERE table_id=1  AND 4 >= begin_snapshot
          AND (4 < end_snapshot OR end_snapshot IS NULL)
    ) del USING (data_file_id)
WHERE data.table_id=1 AND 4 >= data.begin_snapshot AND (4 < data.end_snapshot OR data.end_snapshot IS NULL)

AND data_file_id IN (SELECT data_file_id FROM col_2_stats WHERE max_value IS NULL OR min_value IS NULL OR (500 BETWEEN TRY_CAST(min_value AS INTEGER) AND TRY_CAST(max_value AS INTEGER) OR 600 BETWEEN TRY_CAST(min_value AS INTEGER) AND TRY_CAST(max_value AS INTEGER) OR 700 BETWEEN TRY_CAST(min_value AS INTEGER) AND TRY_CAST(max_value AS INTEGER)))

…tive filters

Mytherin

Thanks for the PR!

The goal of the metadata manager is to keep SQL generation located within it as much as possible. The idea is that we can have separate metadata managers (e.g. the PostgresMetadataManager) that do their own SQL generation for e.g. a different dialect (or ones that don't use SQL at all).

Can we rework this PR to keep the SQL generation contained within the metadata manager instead of "leaking" this into the file list?

J-Meyers · 2025-10-14T14:59:13Z

I'm a bit uncertain about what you want here. Should all of the core logic of converting a TableFilterSet to SQL (and the associated CTEs to SQL) be moved into the metadata manager, including all the logic of GenerateFilterPushdown?

If so, I'll wait for the other simple PR to get in otherwise I'll have merge conflicts moving all 300 lines.

Mytherin · 2025-10-14T15:01:03Z

Yes, ideally no SQL gets generated outside of the metadata manager, and we just move structs describing the operations into the metadata manager. This rule is broken somewhat through the filter already - ideally that is not a SQL string but a set of structs as well.

Logic for generating CTEs for where conditions, does not change effec…

fd6e931

…tive filters

Mytherin reviewed Oct 14, 2025

View reviewed changes

Mytherin added the Changes Requested label Oct 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Filter CTE Generation #514

Filter CTE Generation #514

J-Meyers commented Oct 14, 2025 •

edited

Loading

Uh oh!

Mytherin left a comment

Uh oh!

J-Meyers commented Oct 14, 2025

Uh oh!

Mytherin commented Oct 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Filter CTE Generation #514

Are you sure you want to change the base?

Filter CTE Generation #514

Conversation

J-Meyers commented Oct 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Mytherin left a comment

Choose a reason for hiding this comment

Uh oh!

J-Meyers commented Oct 14, 2025

Uh oh!

Mytherin commented Oct 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

J-Meyers commented Oct 14, 2025 •

edited

Loading