Skip to content

Add wide-schema benchmark suite for measuring per-file metadata overhead#21970

Open
adriangb wants to merge 1 commit intoapache:mainfrom
pydantic:wide-schema-bench
Open

Add wide-schema benchmark suite for measuring per-file metadata overhead#21970
adriangb wants to merge 1 commit intoapache:mainfrom
pydantic:wide-schema-bench

Conversation

@adriangb
Copy link
Copy Markdown
Contributor

@adriangb adriangb commented May 1, 2026

Which issue does this PR close?

#21968

Rationale for this change

Adds a benchmark suite that isolates a wide-schema scan overhead in selective parquet queries: the regime where most of the work is loading footers / column-chunk metadata rather than reading row data, and that cost scales with the number of column chunks in the dataset rather than with the number of columns the query references. Existing benchmarks don't exercise this shape — most TPC-H/ClickBench queries either touch many columns or filter heavily enough that scan work dominates. We need a focused benchmark so this kind of regression is measurable in CI and so optimizations to the wide-schema scan path can be validated.

What changes are included in this PR?

A new sql_benchmarks suite, wide_schema/ (under benchmarks/sql_benchmarks/), with two subgroups selected via BENCH_SUBGROUP:

  • wide — runs against a synthesized wide dataset (1024 cols × 256 files × 50 k rows, ~225 MB). This is the actual workload.
  • narrow — runs the same SQL against an 8-col version of the same dataset (same row count, file count, per-file row-group shape, ~110 MB). This subgroup exists only as a baseline for the wide subgroup; reading its numbers in isolation tells you very little. The per-query wide-vs-narrow ratio is what isolates the schema-width cost.

All 4 queries run on both subgroups so every wide number has a directly comparable narrow baseline.

A new binary, gen_wide_data (in benchmarks/src/bin/), synthesizes both datasets in ~60 s with no external data dependency. The 8-column base schema is generic (id, value, count, ts, category, flag, status, text) and carries deterministic synthetic data; the suffix-renamed copies (id_2, id_3, …, id_128, etc.) are zero-filled. Two design notes:

  • The base columns sit at the end of the schema (positions 1017–1024), with the zero-padded suffix copies before them. Column lookup for the filter / project columns has to traverse past all the padding, exercising any per-column-position cost in the scanner / planner.
  • Padding is zero-filled rather than all-null because the parquet reader can shortcut on all-null statistics — the slowdown reproduces ~35 % wider with zero padding than with null.

Query coverage:

  • Q01 — filter + project + ORDER BY + LIMIT (TopK shortcut)
  • Q02 — project 1 column with a tight filter and LIMIT 1
  • Q03 — tight filter + small projection, no sort
  • Q04 — two low-cardinality string filters + a non-stat-prunable modulo predicate for tight selectivity, project two columns, no LIMIT or ORDER BY

bench.sh additions:

./benchmarks/bench.sh data wide_schema    # synthesizes both subgroups' datasets, ~60 s, ~335 MB total
./benchmarks/bench.sh run  wide_schema    # runs both 'wide' and 'narrow' subgroups

Are these changes tested?

Yes — measurements on a M-series Mac (12-way parallel scan, hot OS cache).

Criterion (3 s warmup, 10 samples, median):

Query narrow wide slowdown
Q01 (TopK shortcut: ORDER BY + LIMIT) 79.3 ms 142.7 ms 1.80×
Q02 (project 1 col, LIMIT 1) 2.1 ms 9.7 ms 4.70×
Q03 (filter+project, no sort) 3.2 ms 12.7 ms 4.01×
Q04 (string filter + tight selectivity, no LIMIT) 36.3 ms 117.8 ms 3.25×

Cold-start datafusion-cli (Q04 shape, median of 3):

narrow wide slowdown
70 ms 1160 ms ~17×

EXPLAIN ANALYZE phase deltas (Q04, cumulative across 12 scan tasks):

Phase narrow wide Δ
metadata_load_time 7.0 ms 843.9 ms 120×
time_elapsed_opening 65.7 ms 77.9 ms 1.2×
time_elapsed_processing 338.0 ms 875.7 ms 2.6×
bloom_filter_eval_time 2.0 ms 2.3 ms flat ✓
statistics_eval_time 12.9 ms 13.5 ms flat ✓

Same qualitative shape: metadata_load_time and per-file setup scale with column-chunk count; predicate-evaluation phases stay flat regardless of schema width.

cargo fmt --all and cargo clippy --bin gen_wide_data --all-features -- -D warnings are clean.

Are there any user-facing changes?

No public API changes. New benchmark suite + new utility binary under benchmarks/.

@adriangb adriangb force-pushed the wide-schema-bench branch 9 times, most recently from d996aee to 147617d Compare May 1, 2026 15:42
Adds a new sql_benchmarks suite that isolates the wide-schema scan
overhead in selective parquet queries: the regime where most of the
work is loading footers / column-chunk metadata rather than reading
row data, and that cost scales with the number of column chunks in
the dataset rather than with the number of columns the query touches.

The wide_schema suite has two subgroups (selected via BENCH_SUBGROUP):

  - wide:   1024 cols x 256 files x 50k rows (~225 MB) — the workload
  - narrow:    8 cols x 256 files x 50k rows (~110 MB) — internal
              baseline, only meaningful as a comparison point

Both share row count, file count, and per-file row-group structure;
only schema width differs. All 4 queries run on both subgroups so
every wide number has a directly comparable narrow baseline.

A new gen_wide_data binary synthesizes both datasets in ~60 s with no
external data source. The 8-column base schema (id, value, count, ts,
category, flag, status, text) carries deterministic data; copies 2..N
from the suffix-renamed widening are zero-filled (zero rather than
null since reader-side null-array shortcuts mute the slowdown by
~35 %).

Query coverage:

  - Q01: filter + project + ORDER BY + LIMIT (TopK shortcut)
  - Q02: project 1 column with tight filter + LIMIT 1
  - Q03: tight filter + small projection, no sort
  - Q04: two low-cardinality string filters + a non-stat-prunable
         modulo predicate for tight selectivity (~0.005 % match rate),
         project two columns, no LIMIT or ORDER BY

For Q04 specifically, cold-start datafusion-cli shows ~15x slowdown
wide vs narrow; EXPLAIN ANALYZE shows metadata_load_time scaling 141x
while bloom_filter_eval_time and statistics_eval_time stay flat.

bench.sh adds:
  - data wide_schema:  synthesizes both wide and narrow datasets
  - run wide_schema:   runs the wide subgroup, then the narrow
                       baseline subgroup, for query-by-query comparison

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@adriangb adriangb force-pushed the wide-schema-bench branch from 147617d to 63eb746 Compare May 1, 2026 15:53
@adriangb adriangb marked this pull request as ready for review May 1, 2026 15:53
@adriangb adriangb requested a review from martin-g May 1, 2026 16:13
@adriangb
Copy link
Copy Markdown
Contributor Author

adriangb commented May 1, 2026

@martin-g mind reviewing this change? cc @Omega359 for use of the new sql benchmark runners

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant