Add wide-schema benchmark suite for measuring per-file metadata overhead by adriangb · Pull Request #21970 · apache/datafusion

adriangb · 2026-05-01T06:43:19Z

Which issue does this PR close?

#21968

Rationale for this change

Adds a benchmark suite that isolates a wide-schema scan overhead in selective parquet queries: the regime where most of the work is loading footers / column-chunk metadata rather than reading row data, and that cost scales with the number of column chunks in the dataset rather than with the number of columns the query references. Existing benchmarks don't exercise this shape — most TPC-H/ClickBench queries either touch many columns or filter heavily enough that scan work dominates. We need a focused benchmark so this kind of regression is measurable in CI and so optimizations to the wide-schema scan path can be validated.

What changes are included in this PR?

A new sql_benchmarks suite, wide_schema/ (under benchmarks/sql_benchmarks/), with two subgroups selected via BENCH_SUBGROUP:

wide — runs against a synthesized wide dataset (1024 cols × 256 files × 50 k rows, ~225 MB). This is the actual workload.
narrow — runs the same SQL against an 8-col version of the same dataset (same row count, file count, per-file row-group shape, ~110 MB). This subgroup exists only as a baseline for the wide subgroup; reading its numbers in isolation tells you very little. The per-query wide-vs-narrow ratio is what isolates the schema-width cost.

All 4 queries run on both subgroups so every wide number has a directly comparable narrow baseline.

A new binary, gen_wide_data (in benchmarks/src/bin/), synthesizes both datasets in ~60 s with no external data dependency. The 8-column base schema is generic (id, value, count, ts, category, flag, status, text) and carries deterministic synthetic data; the suffix-renamed copies (id_2, id_3, …, id_128, etc.) are zero-filled. Two design notes:

The base columns sit at the end of the schema (positions 1017–1024), with the zero-padded suffix copies before them. Column lookup for the filter / project columns has to traverse past all the padding, exercising any per-column-position cost in the scanner / planner.
Padding is zero-filled rather than all-null because the parquet reader can shortcut on all-null statistics — the slowdown reproduces ~35 % wider with zero padding than with null.

Query coverage:

Q01 — filter + project + ORDER BY + LIMIT (TopK shortcut)
Q02 — project 1 column with a tight filter and LIMIT 1
Q03 — tight filter + small projection, no sort
Q04 — two low-cardinality string filters + a non-stat-prunable modulo predicate for tight selectivity, project two columns, no LIMIT or ORDER BY

bench.sh additions:

./benchmarks/bench.sh data wide_schema    # synthesizes both subgroups' datasets, ~60 s, ~335 MB total
./benchmarks/bench.sh run  wide_schema    # runs both 'wide' and 'narrow' subgroups

Are these changes tested?

Yes — measurements on a M-series Mac (12-way parallel scan, hot OS cache).

Criterion (3 s warmup, 10 samples, median):

Query	narrow	wide	slowdown
Q01 (TopK shortcut: ORDER BY + LIMIT)	79.3 ms	142.7 ms	1.80×
Q02 (project 1 col, LIMIT 1)	2.1 ms	9.7 ms	4.70×
Q03 (filter+project, no sort)	3.2 ms	12.7 ms	4.01×
Q04 (string filter + tight selectivity, no LIMIT)	36.3 ms	117.8 ms	3.25×

Cold-start datafusion-cli (Q04 shape, median of 3):

narrow	wide	slowdown
70 ms	1160 ms	~17×

EXPLAIN ANALYZE phase deltas (Q04, cumulative across 12 scan tasks):

Phase	narrow	wide	Δ
metadata_load_time	7.0 ms	843.9 ms	120×
time_elapsed_opening	65.7 ms	77.9 ms	1.2×
time_elapsed_processing	338.0 ms	875.7 ms	2.6×
bloom_filter_eval_time	2.0 ms	2.3 ms	flat ✓
statistics_eval_time	12.9 ms	13.5 ms	flat ✓

Same qualitative shape: metadata_load_time and per-file setup scale with column-chunk count; predicate-evaluation phases stay flat regardless of schema width.

cargo fmt --all and cargo clippy --bin gen_wide_data --all-features -- -D warnings are clean.

Are there any user-facing changes?

No public API changes. New benchmark suite + new utility binary under benchmarks/.

Adds a new sql_benchmarks suite that isolates the wide-schema scan overhead in selective parquet queries: the regime where most of the work is loading footers / column-chunk metadata rather than reading row data, and that cost scales with the number of column chunks in the dataset rather than with the number of columns the query touches. The wide_schema suite has two subgroups (selected via BENCH_SUBGROUP): - wide: 1024 cols x 256 files x 50k rows (~225 MB) — the workload - narrow: 8 cols x 256 files x 50k rows (~110 MB) — internal baseline, only meaningful as a comparison point Both share row count, file count, and per-file row-group structure; only schema width differs. All 4 queries run on both subgroups so every wide number has a directly comparable narrow baseline. A new gen_wide_data binary synthesizes both datasets in ~60 s with no external data source. The 8-column base schema (id, value, count, ts, category, flag, status, text) carries deterministic data; copies 2..N from the suffix-renamed widening are zero-filled (zero rather than null since reader-side null-array shortcuts mute the slowdown by ~35 %). Query coverage: - Q01: filter + project + ORDER BY + LIMIT (TopK shortcut) - Q02: project 1 column with tight filter + LIMIT 1 - Q03: tight filter + small projection, no sort - Q04: two low-cardinality string filters + a non-stat-prunable modulo predicate for tight selectivity (~0.005 % match rate), project two columns, no LIMIT or ORDER BY For Q04 specifically, cold-start datafusion-cli shows ~15x slowdown wide vs narrow; EXPLAIN ANALYZE shows metadata_load_time scaling 141x while bloom_filter_eval_time and statistics_eval_time stay flat. bench.sh adds: - data wide_schema: synthesizes both wide and narrow datasets - run wide_schema: runs the wide subgroup, then the narrow baseline subgroup, for query-by-query comparison Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

adriangb · 2026-05-01T16:14:54Z

@martin-g mind reviewing this change? cc @Omega359 for use of the new sql benchmark runners

adriangb force-pushed the wide-schema-bench branch 9 times, most recently from d996aee to 147617d Compare May 1, 2026 15:42

adriangb force-pushed the wide-schema-bench branch from 147617d to 63eb746 Compare May 1, 2026 15:53

adriangb marked this pull request as ready for review May 1, 2026 15:53

adriangb mentioned this pull request May 1, 2026

Add benchmarks for queries against wide schema parquet files #21968

Open

adriangb requested a review from martin-g May 1, 2026 16:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add wide-schema benchmark suite for measuring per-file metadata overhead#21970

Add wide-schema benchmark suite for measuring per-file metadata overhead#21970
adriangb wants to merge 1 commit intoapache:mainfrom
pydantic:wide-schema-bench

adriangb commented May 1, 2026 •

edited

Loading

Uh oh!

adriangb commented May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

adriangb commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

adriangb commented May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

adriangb commented May 1, 2026 •

edited

Loading