Skip to content

Add benchmarks for queries against wide schema parquet files #21968

@adriangb

Description

@adriangb

There is some great work going on in arrow-rs (apache/arrow-rs#9722) by @HippoBaro to make wide schemas (schemas with many columns) more performant.

We have similarly wide data and spend a lot of time in our queries loading parquet metadata. I spent some time investigating this today and came to the conclusion that not only is it expensive to load wide schemas, they also introduce overhead in the rest of the query. Below is a summary of the analysis (AI generated). This makes me think we need to tackle the problem from the DataFusion side as well, with the first step being adding a benchmark for wide schema queries.

TL;DR

A simple selective query over 339 Parquet files runs ~11× slower when the
files carry hundreds of columns the query never touches, even though every
unused column is correctly projected away and every unused row group is
correctly pruned.

metadata_load_time is the cleanest signal but only ~10% of the wall-clock
overhead. The rest sits in downstream per-file setup that scales with schema
width — most of it not explicitly bucketed by EXPLAIN ANALYZE.

Setup

  • Two copies of the same 339 Parquet files, same rows, same row groups.

  • wide: full production schema (681 unique schema paths, 163,926 column
    chunks across the dataset).

  • narrow: only the 4 columns the query references (4 schema paths, 1,356
    column chunks).

  • Engine: datafusion-cli 53.0.0, local disk, hot OS cache, 12-way scan
    parallelism.

  • Query:

    SELECT start_timestamp, trace_id
    FROM '<path>/*.parquet'
    WHERE service_name = 'api' AND span_name = 'event';

    Returns 1,287 rows out of ~12.1 M scanned. Pruning behaviour is identical
    on both datasets: 350→350 files matched on stats, 339→339 row groups
    matched on stats, 339→320 row groups after bloom-filter pruning.

Wall time (3 hot runs each)

Dataset run 1 run 2 run 3 median
narrow 0.053 s 0.046 s 0.049 s 0.049 s
wide 0.548 s 0.556 s 0.557 s 0.557 s

EXPLAIN ANALYZE — DataSourceExec metrics

Sums across 12 scan tasks. Same output_bytes=1266.2 MB, same final row
count, identical pruning.

Phase narrow wide Δ
metadata_load_time 8.4 ms 580 ms +572 ms
time_elapsed_opening 27 ms 161 ms +134 ms
time_elapsed_scanning_until_data 308 ms 526 ms +218 ms
time_elapsed_processing 307 ms 737 ms +430 ms
time_elapsed_scanning_total 391 ms 755 ms +364 ms
bloom_filter_eval_time 349 ms 284 ms -65 ms
statistics_eval_time 4.6 ms 7.0 ms +2.4 ms
page_index_eval_time 13 µs 14 µs ~0
FilterExec elapsed_compute 17.7 ms 17.3 ms ~0

Where the cost lands

Wall-clock delta is +508 ms going narrow → wide. With ~12-way parallelism,
dividing the cumulative metrics by 12 gives a rough wall-clock estimate per
phase:

Phase Cumulative Δ ~Wall Δ
metadata_load_time +572 ms ~48 ms
time_elapsed_opening +134 ms ~11 ms
time_elapsed_processing (excl. metadata + opening) ~+296 ms ~25 ms
Unattributed (serial setup, expression binding against wide schema, scan-task dispatch) ~420 ms

Observations

  1. metadata_load_time scales linearly with column-chunk count. ~3.5 µs
    per chunk in both datasets (8 ms / 1,356 ≈ 580 ms / 163,926). The
    per-chunk cost is layout-independent; what changes is how many chunks the
    footer describes.

  2. The wide schema keeps costing after metadata is loaded.
    time_elapsed_processing more than doubles and time_elapsed_scanning_until_data
    grows by 218 ms even though the projected columns are identical. That is
    per-column reader-state construction, schema-descriptor traversal, and
    predicate binding against a 681-element schema running per file before
    the first useful byte flows.

  3. Predicate evaluation phases are flat. bloom_filter_eval_time,
    statistics_eval_time, page_index_eval_time, FilterExec.elapsed_compute
    — none of them respond to schema width. Once row groups are selected, the
    unused columns do not participate.

  4. Most of the wall-clock gap is unattributed by EXPLAIN ANALYZE. The
    bucketed phases sum to ~85 ms wall; the observed delta is 508 ms wall.
    The remaining ~420 ms is serial coordinator/planner work that scales with
    schema width and is not broken out in current metrics. On object storage
    this is likely amplified further because every file open also pays a
    network round-trip proportional to footer size.

Conclusion

The narrow result (~49 ms) is roughly what this query should cost. Anything
above that is the price of carrying schema descriptions for columns the query
never touched through the planner and scanner.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions