Add benchmarks for queries against wide schema parquet files

There is some great work going on in arrow-rs (https://github.com/apache/arrow-rs/issues/9722) by @HippoBaro to make wide schemas (schemas with many columns) more performant.

We have similarly wide data and spend a lot of time in our queries loading parquet metadata. I spent some time investigating this today and came to the conclusion that not only is it expensive to load wide schemas, they also introduce overhead in the rest of the query. Below is a summary of the analysis (AI generated). This makes me think we need to tackle the problem from the DataFusion side as well, with the first step being **adding a benchmark for wide schema queries**.

## TL;DR

A simple selective query over 339 Parquet files runs **~11× slower** when the
files carry hundreds of columns the query never touches, even though every
unused column is correctly projected away and every unused row group is
correctly pruned.

`metadata_load_time` is the cleanest signal but only ~10% of the wall-clock
overhead. The rest sits in downstream per-file setup that scales with schema
width — most of it not explicitly bucketed by `EXPLAIN ANALYZE`.

## Setup

- Two copies of the same 339 Parquet files, same rows, same row groups.
- **wide**: full production schema (681 unique schema paths, 163,926 column
  chunks across the dataset).
- **narrow**: only the 4 columns the query references (4 schema paths, 1,356
  column chunks).
- Engine: `datafusion-cli 53.0.0`, local disk, hot OS cache, 12-way scan
  parallelism.
- Query:

  ```sql
  SELECT start_timestamp, trace_id
  FROM '<path>/*.parquet'
  WHERE service_name = 'api' AND span_name = 'event';
  ```

  Returns 1,287 rows out of ~12.1 M scanned. Pruning behaviour is identical
  on both datasets: 350→350 files matched on stats, 339→339 row groups
  matched on stats, 339→320 row groups after bloom-filter pruning.

## Wall time (3 hot runs each)

| Dataset | run 1 | run 2 | run 3 | median |
|---|---|---|---|---|
| narrow | 0.053 s | 0.046 s | 0.049 s | **0.049 s** |
| wide   | 0.548 s | 0.556 s | 0.557 s | **0.557 s** |

## EXPLAIN ANALYZE — DataSourceExec metrics

Sums across 12 scan tasks. Same `output_bytes=1266.2 MB`, same final row
count, identical pruning.

| Phase | narrow | wide | Δ |
|---|---|---|---|
| **metadata_load_time** | 8.4 ms | 580 ms | **+572 ms** |
| time_elapsed_opening | 27 ms | 161 ms | +134 ms |
| time_elapsed_scanning_until_data | 308 ms | 526 ms | +218 ms |
| time_elapsed_processing | 307 ms | 737 ms | +430 ms |
| time_elapsed_scanning_total | 391 ms | 755 ms | +364 ms |
| bloom_filter_eval_time | 349 ms | 284 ms | -65 ms |
| statistics_eval_time | 4.6 ms | 7.0 ms | +2.4 ms |
| page_index_eval_time | 13 µs | 14 µs | ~0 |
| FilterExec elapsed_compute | 17.7 ms | 17.3 ms | ~0 |

## Where the cost lands

Wall-clock delta is **+508 ms** going narrow → wide. With ~12-way parallelism,
dividing the cumulative metrics by 12 gives a rough wall-clock estimate per
phase:

| Phase | Cumulative Δ | ~Wall Δ |
|---|---|---|
| metadata_load_time | +572 ms | ~48 ms |
| time_elapsed_opening | +134 ms | ~11 ms |
| time_elapsed_processing (excl. metadata + opening) | ~+296 ms | ~25 ms |
| Unattributed (serial setup, expression binding against wide schema, scan-task dispatch) | — | **~420 ms** |

## Observations

1. **`metadata_load_time` scales linearly with column-chunk count.** ~3.5 µs
   per chunk in both datasets (8 ms / 1,356 ≈ 580 ms / 163,926). The
   per-chunk cost is layout-independent; what changes is how many chunks the
   footer describes.

2. **The wide schema keeps costing after metadata is loaded.**
   `time_elapsed_processing` more than doubles and `time_elapsed_scanning_until_data`
   grows by 218 ms even though the projected columns are identical. That is
   per-column reader-state construction, schema-descriptor traversal, and
   predicate binding against a 681-element schema running per file before
   the first useful byte flows.

3. **Predicate evaluation phases are flat.**  `bloom_filter_eval_time`,
   `statistics_eval_time`, `page_index_eval_time`, `FilterExec.elapsed_compute`
   — none of them respond to schema width. Once row groups are selected, the
   unused columns do not participate.

4. **Most of the wall-clock gap is unattributed by `EXPLAIN ANALYZE`.** The
   bucketed phases sum to ~85 ms wall; the observed delta is 508 ms wall.
   The remaining ~420 ms is serial coordinator/planner work that scales with
   schema width and is not broken out in current metrics. On object storage
   this is likely amplified further because every file open also pays a
   network round-trip proportional to footer size.

## Conclusion

The narrow result (~49 ms) is roughly what this query *should* cost. Anything
above that is the price of carrying schema descriptions for columns the query
never touched through the planner and scanner.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add benchmarks for queries against wide schema parquet files #21968

TL;DR

Setup

Wall time (3 hot runs each)

EXPLAIN ANALYZE — DataSourceExec metrics

Where the cost lands

Observations

Conclusion

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Dataset	run 1	run 2	run 3	median
narrow	0.053 s	0.046 s	0.049 s	0.049 s
wide	0.548 s	0.556 s	0.557 s	0.557 s

Phase	narrow	wide	Δ
metadata_load_time	8.4 ms	580 ms	+572 ms
time_elapsed_opening	27 ms	161 ms	+134 ms
time_elapsed_scanning_until_data	308 ms	526 ms	+218 ms
time_elapsed_processing	307 ms	737 ms	+430 ms
time_elapsed_scanning_total	391 ms	755 ms	+364 ms
bloom_filter_eval_time	349 ms	284 ms	-65 ms
statistics_eval_time	4.6 ms	7.0 ms	+2.4 ms
page_index_eval_time	13 µs	14 µs	~0
FilterExec elapsed_compute	17.7 ms	17.3 ms	~0

Phase	Cumulative Δ	~Wall Δ
metadata_load_time	+572 ms	~48 ms
time_elapsed_opening	+134 ms	~11 ms
time_elapsed_processing (excl. metadata + opening)	~+296 ms	~25 ms
Unattributed (serial setup, expression binding against wide schema, scan-task dispatch)	—	~420 ms

Add benchmarks for queries against wide schema parquet files #21968

Description

TL;DR

Setup

Wall time (3 hot runs each)

EXPLAIN ANALYZE — DataSourceExec metrics

Where the cost lands

Observations

Conclusion

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions