perf: aggregate min/max by joseph-isaacs · Pull Request #8061 · vortex-data/vortex

joseph-isaacs · 2026-05-22T12:28:52Z

Adds a divan benchmark exercising the min/max aggregation over primitive
arrays (i32/i64/f64, with and without nulls) so we can measure and inspect
the codegen of the max reduction path.

Signed-off-by: Joe Isaacs joe.isaacs@live.co.uk

Adds a divan benchmark exercising the min/max aggregation over primitive arrays (i32/i64/f64, with and without nulls) so we can measure and inspect the codegen of the max reduction path. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>

The all-valid primitive min/max path used `itertools::minmax_by` with a `total_compare` closure preceded by a NaN filter, which the autovectorizer could not lower to packed min/max, leaving a scalar cmov reduction. Route the all-true mask case for integer ptypes through a plain reduction. Integers have no NaNs, so the NaN filter is unnecessary and LLVM vectorizes the loop (pmaxub/pmaxsw, and pcmpgtd-based blends for i32/i64). Floats keep the existing NaN-aware path. Benchmarked over 1M elements: i32 all-valid ~2.93ms -> ~0.36ms, i64 ~3.02ms -> ~0.55ms. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>

codspeed-hq · 2026-05-22T12:41:06Z

Merging this PR will improve performance by 15.1%

⚠️

Unknown Walltime execution environment detected

Using the Walltime instrument on standard Hosted Runners will lead to inconsistent data.

For the most accurate results, we recommend using CodSpeed Macro Runners: bare-metal machines fine-tuned for performance measurement consistency.

⚠️

Different runtime environments detected

Some benchmarks with significant performance changes were compared across different runtime environments,
which may affect the accuracy of the results.

Open the report in CodSpeed to investigate

⚡ 11 improved benchmarks
❌ 1 regressed benchmark
✅ 1239 untouched benchmarks
🆕 10 new benchmarks

Warning

Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

	Mode	Benchmark	`BASE`	`HEAD`	Efficiency
🆕	Simulation	`max_i32_nulls_scattered`	N/A	1.5 ms	N/A
🆕	Simulation	`max_f64`	N/A	1.1 ms	N/A
🆕	Simulation	`max_i32`	N/A	222.6 µs	N/A
🆕	Simulation	`max_i64`	N/A	436.3 µs	N/A
🆕	Simulation	`sum_i32`	N/A	222.3 µs	N/A
❌	Simulation	`chunked_varbinview_canonical_into[(100, 100)]`	273.2 µs	307.9 µs	-11.27%
🆕	Simulation	`max_i32_nulls_clustered`	N/A	249.5 µs	N/A
🆕	Simulation	`sum_i32_nulls_clustered`	N/A	236.1 µs	N/A
🆕	Simulation	`sum_i32_nulls_scattered`	N/A	1.6 ms	N/A
🆕	Simulation	`sum_i64`	N/A	600.4 µs	N/A
⚡	Simulation	`chunked_varbinview_opt_canonical_into[(1000, 10)]`	225.1 µs	187.5 µs	+20.07%
🆕	Simulation	`sum_u32`	N/A	222.1 µs	N/A
⚡	Simulation	`encode_primitives[u8, (10000, 2)]`	313.9 µs	278.3 µs	+12.81%
⚡	Simulation	`encode_primitives[u8, (10000, 32)]`	318.5 µs	282.5 µs	+12.75%
⚡	Simulation	`encode_primitives[u8, (10000, 4)]`	314.3 µs	278.5 µs	+12.84%
⚡	Simulation	`encode_primitives[u8, (10000, 512)]`	335.1 µs	299.1 µs	+12.07%
⚡	Simulation	`encode_primitives[u8, (10000, 8)]`	315.2 µs	279.2 µs	+12.89%
⚡	Simulation	`for_compress_i32`	753.4 µs	444.2 µs	+69.61%
⚡	Simulation	`take_10k_contiguous`	309.6 µs	281 µs	+10.17%
⚡	Simulation	`new_alp_prim_test_between[f32, 16384]`	118.6 µs	103.9 µs	+14.14%
...	...	...	...	...	...

ℹ️ Only the first 20 benchmarks are displayed. Go to the app to view all benchmarks.

Tip

Investigate this regression by commenting @codspeedbot fix this regression on this PR, or directly use the CodSpeed MCP with your agent.

_{Comparing claude/great-edison-jrGY0 (3d86cfe) with develop (dbfe521)}

Keep a single all-valid bench for i32, i64, and f64 instead of the per-type all-valid/half-null pairs. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>

Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>

The all-valid integer sum did a per-element `checked_add`, whose overflow early-return branch blocked autovectorization, leaving a scalar loop. Sum narrower-than-64-bit integers in chunks of 65536 into a widened 64-bit accumulator with no per-element check: a chunk of <64-bit values cannot overflow the 64-bit accumulator (2^16 * (2^32-1) < 2^64), so only one checked add per chunk is needed. This lets the inner loop vectorize to packed widening adds (paddq + unpck). 64-bit inputs keep the per-element checked path since a chunk of 64-bit values could itself overflow. This observes overflow at chunk boundaries rather than per element, so a signed sum whose running total transiently leaves i64 range but ends in range now returns the true total instead of null. The final result is unchanged whenever the existing per-batch combine did not already overflow. Benchmarked over 100k elements: sum_i32 ~19us, sum_u32 ~15us, sum_i64 ~51us. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>

The with-nulls paths for primitive sum and min/max walked the values zipped with a per-element validity bit, which kept them scalar even though their all-valid counterparts already autovectorize to packed widening adds (sum) and packed min/max. Drive both null paths from the validity mask's contiguous `[start, end)` runs (`Mask::slices`, computed once and cached). Each run is fully valid, so it reuses the existing vectorized all-valid reduction: sum folds each run through the chunked widening accumulator; min/max folds the native per-run integer min/max, with floats chaining the runs through the NaN-filtering reduction. Results are unchanged. To support the fold, integer min/max now returns native `(min, max)` (`integer_min_max_raw`) which both the all-valid and run paths reduce before building the result scalar. Benchmarked over 100k i32 elements (added nullable bench cases): - clustered nulls: sum 106us -> 29us, max 159us -> 35us - scattered ~50% nulls: no regression (sum 584us -> 530us, max 606us -> 593us) Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>

Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>

claude added 2 commits May 22, 2026 11:13

Add aggregate max divan benchmark

1831743

Adds a divan benchmark exercising the min/max aggregation over primitive arrays (i32/i64/f64, with and without nulls) so we can measure and inspect the codegen of the max reduction path. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>

joseph-isaacs changed the title ~~Add aggregate max divan benchmark~~ [claude] Add aggregate max divan benchmark May 22, 2026

claude added 2 commits May 22, 2026 13:18

Simplify aggregate max benchmark to one bench per type

ac47a14

Keep a single all-valid bench for i32, i64, and f64 instead of the per-type all-valid/half-null pairs. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>

Reduce aggregate max benchmark array length to 100k

08abd6a

Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>

joseph-isaacs requested a review from robert3005 May 22, 2026 14:30

joseph-isaacs changed the title ~~[claude] Add aggregate max divan benchmark~~ perf: aggregate min/max May 22, 2026

joseph-isaacs added the changelog/performance A performance improvement label May 22, 2026

robert3005 approved these changes May 22, 2026

View reviewed changes

robert3005 reviewed May 22, 2026

View reviewed changes

Comment thread vortex-array/benches/aggregate_max.rs Outdated

claude and others added 3 commits May 22, 2026 18:01

fix

3d86cfe

Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>

joseph-isaacs merged commit 4b089af into develop May 27, 2026
61 of 62 checks passed

joseph-isaacs deleted the claude/great-edison-jrGY0 branch May 27, 2026 10:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: aggregate min/max#8061

perf: aggregate min/max#8061
joseph-isaacs merged 7 commits into
developfrom
claude/great-edison-jrGY0

joseph-isaacs commented May 22, 2026

Uh oh!

codspeed-hq Bot commented May 22, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

joseph-isaacs commented May 22, 2026

Uh oh!

codspeed-hq Bot commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merging this PR will improve performance by 15.1%

Performance Changes

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codspeed-hq Bot commented May 22, 2026 •

edited

Loading