Skip to content

Conversation

J-Meyers
Copy link
Contributor

@J-Meyers J-Meyers commented Oct 14, 2025

This modifies the GetFilesForTable and DuckLakeMultiFileList to first generate CTEs for each of the specific column stats.

This does not change the actual filters that are being run, it only changes the shape of the query. It includes logic to explicitly materialize the CTE if there are multiple references, and explicitly says not materialized when it is only 1 rather than leaving it up to the optimizer. Since it is currently run on a single TableFilter the reference count is always 1 and they are never marked as materialized.

Part of the continuation of breaking up #477 into more reviewable chunks as suggested

This makes a material difference on query time only when the same column is referenced multiple times, as can happen with complex filters (which are not yet processed) because in ducklakes with lots of table with lots of columns a very significant portion of the total time is just looking up the column stats, not comparing them, so doing that multiple times really hurts query execution time.

Sample user query:

SELECT COUNT(*) FROM ducklake.filter_pushdown WHERE i IN (500, 600, 700);

Sample generated query:

WITH col_2_stats AS NOT MATERIALIZED (
  SELECT data_file_id, max_value, min_value
  FROM {METADATA_CATALOG}.ducklake_file_column_stats
  WHERE column_id = 2 AND table_id = 1
)

SELECT data.path, data.path_is_relative, data.file_size_bytes, data.footer_size, data.row_id_start, data.begin_snapshot, data.partial_file_info, data.mapping_id, del.path, del.path_is_relative, del.file_size_bytes, del.footer_size
FROM {METADATA_CATALOG}.ducklake_data_file data
LEFT JOIN (
    SELECT *
    FROM {METADATA_CATALOG}.ducklake_delete_file
    WHERE table_id=1  AND 4 >= begin_snapshot
          AND (4 < end_snapshot OR end_snapshot IS NULL)
    ) del USING (data_file_id)
WHERE data.table_id=1 AND 4 >= data.begin_snapshot AND (4 < data.end_snapshot OR data.end_snapshot IS NULL)

AND data_file_id IN (SELECT data_file_id FROM col_2_stats WHERE max_value IS NULL OR min_value IS NULL OR (500 BETWEEN TRY_CAST(min_value AS INTEGER) AND TRY_CAST(max_value AS INTEGER) OR 600 BETWEEN TRY_CAST(min_value AS INTEGER) AND TRY_CAST(max_value AS INTEGER) OR 700 BETWEEN TRY_CAST(min_value AS INTEGER) AND TRY_CAST(max_value AS INTEGER)))

Copy link
Contributor

@Mytherin Mytherin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR!

The goal of the metadata manager is to keep SQL generation located within it as much as possible. The idea is that we can have separate metadata managers (e.g. the PostgresMetadataManager) that do their own SQL generation for e.g. a different dialect (or ones that don't use SQL at all).

Can we rework this PR to keep the SQL generation contained within the metadata manager instead of "leaking" this into the file list?

@J-Meyers
Copy link
Contributor Author

I'm a bit uncertain about what you want here. Should all of the core logic of converting a TableFilterSet to SQL (and the associated CTEs to SQL) be moved into the metadata manager, including all the logic of GenerateFilterPushdown?

If so, I'll wait for the other simple PR to get in otherwise I'll have merge conflicts moving all 300 lines.

@Mytherin
Copy link
Contributor

Yes, ideally no SQL gets generated outside of the metadata manager, and we just move structs describing the operations into the metadata manager. This rule is broken somewhat through the filter already - ideally that is not a SQL string but a set of structs as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants