Skip to content

Queries over Iceberg don't use Iceberg metadata #67158

@mixermt

Description

@mixermt

We have a large table (over 250TB) partitioned by hour (ts column).
The ts column contains only hours and has full iceberg statistics over it, which have been verified in Iceberg metadata.

However, when we perform a simple query to retrieve the maximum and minimum values of the ts column, I observe that the query profile shows a full scan of the ts column instead of checking the Iceberg metadata for the statistics.

We have tried to apply following parameters but still full scan performed

enable_rewrite_simple_agg_to_hdfs_scan=true,
enable_rewrite_simple_agg_to_meta_scan=true

Steps to reproduce the behavior (Required)

SELECT
    /*+ SET_VAR(
    enable_rewrite_simple_agg_to_hdfs_scan=true,
    enable_rewrite_simple_agg_to_meta_scan = true)
      */
    min(ts), max(ts)
FROM our_schema.iceberg_table;

Expected behavior (Required)

Queries over Iceberg should use Iceberg metadata to avoid costly scan of the data

Real behavior (Required)

Result in 1TB scan and execution time of ~4m on our, ~640 cores cluster.

StarRocks version (Required)

4.0.2-1f1aa9c

Query Profile attached:

...
on-default:
                 - size: 567
                 - columns: {..., ts=full, ...}
...
- HdfsIOMetrics: 
                 - TotalBytesRead: 967.851 GB
...

max_min_iceberg_query_ profile.txt

Metadata

Metadata

Assignees

No one assigned

    Labels

    type/bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions