Skip to content

perf: aggregate min/max#8061

Open
joseph-isaacs wants to merge 6 commits into
developfrom
claude/great-edison-jrGY0
Open

perf: aggregate min/max#8061
joseph-isaacs wants to merge 6 commits into
developfrom
claude/great-edison-jrGY0

Conversation

@joseph-isaacs
Copy link
Copy Markdown
Contributor

Adds a divan benchmark exercising the min/max aggregation over primitive
arrays (i32/i64/f64, with and without nulls) so we can measure and inspect
the codegen of the max reduction path.

Signed-off-by: Joe Isaacs joe.isaacs@live.co.uk

claude added 2 commits May 22, 2026 11:13
Adds a divan benchmark exercising the min/max aggregation over primitive
arrays (i32/i64/f64, with and without nulls) so we can measure and inspect
the codegen of the max reduction path.

Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
The all-valid primitive min/max path used `itertools::minmax_by` with a
`total_compare` closure preceded by a NaN filter, which the autovectorizer
could not lower to packed min/max, leaving a scalar cmov reduction.

Route the all-true mask case for integer ptypes through a plain reduction.
Integers have no NaNs, so the NaN filter is unnecessary and LLVM vectorizes
the loop (pmaxub/pmaxsw, and pcmpgtd-based blends for i32/i64). Floats keep
the existing NaN-aware path.

Benchmarked over 1M elements: i32 all-valid ~2.93ms -> ~0.36ms, i64
~3.02ms -> ~0.55ms.

Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
@joseph-isaacs joseph-isaacs changed the title Add aggregate max divan benchmark [claude] Add aggregate max divan benchmark May 22, 2026
@codspeed-hq
Copy link
Copy Markdown

codspeed-hq Bot commented May 22, 2026

Merging this PR will improve performance by 14.49%

⚠️ Unknown Walltime execution environment detected

Using the Walltime instrument on standard Hosted Runners will lead to inconsistent data.

For the most accurate results, we recommend using CodSpeed Macro Runners: bare-metal machines fine-tuned for performance measurement consistency.

⚠️ Different runtime environments detected

Some benchmarks with significant performance changes were compared across different runtime environments,
which may affect the accuracy of the results.

Open the report in CodSpeed to investigate

⚡ 10 improved benchmarks
❌ 1 regressed benchmark
✅ 1240 untouched benchmarks
🆕 10 new benchmarks

Warning

Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

Mode Benchmark BASE HEAD Efficiency
WallTime cuda/bitpacked_u8/unpack/3bw[100M] 353 µs 299.9 µs +17.69%
🆕 Simulation max_i32 N/A 222.6 µs N/A
🆕 Simulation max_i32_nulls_clustered N/A 254.3 µs N/A
🆕 Simulation max_i32_nulls_scattered N/A 1.6 ms N/A
🆕 Simulation max_i64 N/A 436.3 µs N/A
Simulation chunked_varbinview_into_canonical[(100, 100)] 358.3 µs 323.3 µs +10.84%
🆕 Simulation sum_i32 N/A 222.1 µs N/A
Simulation chunked_varbinview_into_canonical[(1000, 10)] 211.8 µs 176.2 µs +20.17%
🆕 Simulation sum_i32_nulls_clustered N/A 236.2 µs N/A
🆕 Simulation sum_i32_nulls_scattered N/A 1.6 ms N/A
🆕 Simulation sum_i64 N/A 600.4 µs N/A
Simulation chunked_varbinview_opt_canonical_into[(1000, 10)] 187.5 µs 224.7 µs -16.54%
🆕 Simulation sum_u32 N/A 222.1 µs N/A
🆕 Simulation max_f64 N/A 1.1 ms N/A
Simulation encode_primitives[u8, (10000, 2)] 313.9 µs 278.2 µs +12.84%
Simulation encode_primitives[u8, (10000, 32)] 317.7 µs 282.4 µs +12.51%
Simulation encode_primitives[u8, (10000, 4)] 314.2 µs 278.4 µs +12.88%
Simulation encode_primitives[u8, (10000, 512)] 334.9 µs 299.1 µs +11.96%
Simulation encode_primitives[u8, (10000, 8)] 315.2 µs 279.2 µs +12.9%
Simulation for_compress_i32 753.2 µs 444.1 µs +69.61%
... ... ... ... ... ...

ℹ️ Only the first 20 benchmarks are displayed. Go to the app to view all benchmarks.

Tip

Investigate this regression by commenting @codspeedbot fix this regression on this PR, or directly use the CodSpeed MCP with your agent.


Comparing claude/great-edison-jrGY0 (6a44f51) with develop (495f30e)

Open in CodSpeed

claude added 2 commits May 22, 2026 13:18
Keep a single all-valid bench for i32, i64, and f64 instead of the
per-type all-valid/half-null pairs.

Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
@joseph-isaacs joseph-isaacs requested a review from robert3005 May 22, 2026 14:30
@joseph-isaacs joseph-isaacs changed the title [claude] Add aggregate max divan benchmark perf: aggregate min/max May 22, 2026
@joseph-isaacs joseph-isaacs added the changelog/performance A performance improvement label May 22, 2026
.with_inputs(|| PrimitiveArray::from_iter(data.iter().copied()).into_array())
.bench_refs(|a| {
a.statistics()
.compute_max::<i32>(&mut LEGACY_SESSION.create_execution_ctx())
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you create a local session here?

claude added 2 commits May 22, 2026 18:01
The all-valid integer sum did a per-element `checked_add`, whose overflow
early-return branch blocked autovectorization, leaving a scalar loop.

Sum narrower-than-64-bit integers in chunks of 65536 into a widened 64-bit
accumulator with no per-element check: a chunk of <64-bit values cannot
overflow the 64-bit accumulator (2^16 * (2^32-1) < 2^64), so only one
checked add per chunk is needed. This lets the inner loop vectorize to
packed widening adds (paddq + unpck). 64-bit inputs keep the per-element
checked path since a chunk of 64-bit values could itself overflow.

This observes overflow at chunk boundaries rather than per element, so a
signed sum whose running total transiently leaves i64 range but ends in
range now returns the true total instead of null. The final result is
unchanged whenever the existing per-batch combine did not already overflow.

Benchmarked over 100k elements: sum_i32 ~19us, sum_u32 ~15us, sum_i64 ~51us.

Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
The with-nulls paths for primitive sum and min/max walked the values
zipped with a per-element validity bit, which kept them scalar even
though their all-valid counterparts already autovectorize to packed
widening adds (sum) and packed min/max.

Drive both null paths from the validity mask's contiguous `[start, end)`
runs (`Mask::slices`, computed once and cached). Each run is fully valid,
so it reuses the existing vectorized all-valid reduction: sum folds each
run through the chunked widening accumulator; min/max folds the native
per-run integer min/max, with floats chaining the runs through the
NaN-filtering reduction. Results are unchanged.

To support the fold, integer min/max now returns native `(min, max)`
(`integer_min_max_raw`) which both the all-valid and run paths reduce
before building the result scalar.

Benchmarked over 100k i32 elements (added nullable bench cases):
- clustered nulls: sum 106us -> 29us, max 159us -> 35us
- scattered ~50% nulls: no regression (sum 584us -> 530us, max 606us -> 593us)

Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

changelog/performance A performance improvement

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants