perf: aggregate min/max#8061
Conversation
Adds a divan benchmark exercising the min/max aggregation over primitive arrays (i32/i64/f64, with and without nulls) so we can measure and inspect the codegen of the max reduction path. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
The all-valid primitive min/max path used `itertools::minmax_by` with a `total_compare` closure preceded by a NaN filter, which the autovectorizer could not lower to packed min/max, leaving a scalar cmov reduction. Route the all-true mask case for integer ptypes through a plain reduction. Integers have no NaNs, so the NaN filter is unnecessary and LLVM vectorizes the loop (pmaxub/pmaxsw, and pcmpgtd-based blends for i32/i64). Floats keep the existing NaN-aware path. Benchmarked over 1M elements: i32 all-valid ~2.93ms -> ~0.36ms, i64 ~3.02ms -> ~0.55ms. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
Merging this PR will improve performance by 14.49%
|
| Mode | Benchmark | BASE |
HEAD |
Efficiency | |
|---|---|---|---|---|---|
| ⚡ | WallTime | cuda/bitpacked_u8/unpack/3bw[100M] |
353 µs | 299.9 µs | +17.69% |
| 🆕 | Simulation | max_i32 |
N/A | 222.6 µs | N/A |
| 🆕 | Simulation | max_i32_nulls_clustered |
N/A | 254.3 µs | N/A |
| 🆕 | Simulation | max_i32_nulls_scattered |
N/A | 1.6 ms | N/A |
| 🆕 | Simulation | max_i64 |
N/A | 436.3 µs | N/A |
| ⚡ | Simulation | chunked_varbinview_into_canonical[(100, 100)] |
358.3 µs | 323.3 µs | +10.84% |
| 🆕 | Simulation | sum_i32 |
N/A | 222.1 µs | N/A |
| ⚡ | Simulation | chunked_varbinview_into_canonical[(1000, 10)] |
211.8 µs | 176.2 µs | +20.17% |
| 🆕 | Simulation | sum_i32_nulls_clustered |
N/A | 236.2 µs | N/A |
| 🆕 | Simulation | sum_i32_nulls_scattered |
N/A | 1.6 ms | N/A |
| 🆕 | Simulation | sum_i64 |
N/A | 600.4 µs | N/A |
| ❌ | Simulation | chunked_varbinview_opt_canonical_into[(1000, 10)] |
187.5 µs | 224.7 µs | -16.54% |
| 🆕 | Simulation | sum_u32 |
N/A | 222.1 µs | N/A |
| 🆕 | Simulation | max_f64 |
N/A | 1.1 ms | N/A |
| ⚡ | Simulation | encode_primitives[u8, (10000, 2)] |
313.9 µs | 278.2 µs | +12.84% |
| ⚡ | Simulation | encode_primitives[u8, (10000, 32)] |
317.7 µs | 282.4 µs | +12.51% |
| ⚡ | Simulation | encode_primitives[u8, (10000, 4)] |
314.2 µs | 278.4 µs | +12.88% |
| ⚡ | Simulation | encode_primitives[u8, (10000, 512)] |
334.9 µs | 299.1 µs | +11.96% |
| ⚡ | Simulation | encode_primitives[u8, (10000, 8)] |
315.2 µs | 279.2 µs | +12.9% |
| ⚡ | Simulation | for_compress_i32 |
753.2 µs | 444.1 µs | +69.61% |
| ... | ... | ... | ... | ... | ... |
ℹ️ Only the first 20 benchmarks are displayed. Go to the app to view all benchmarks.
Tip
Investigate this regression by commenting @codspeedbot fix this regression on this PR, or directly use the CodSpeed MCP with your agent.
Comparing claude/great-edison-jrGY0 (6a44f51) with develop (495f30e)
Keep a single all-valid bench for i32, i64, and f64 instead of the per-type all-valid/half-null pairs. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
| .with_inputs(|| PrimitiveArray::from_iter(data.iter().copied()).into_array()) | ||
| .bench_refs(|a| { | ||
| a.statistics() | ||
| .compute_max::<i32>(&mut LEGACY_SESSION.create_execution_ctx()) |
There was a problem hiding this comment.
can you create a local session here?
The all-valid integer sum did a per-element `checked_add`, whose overflow early-return branch blocked autovectorization, leaving a scalar loop. Sum narrower-than-64-bit integers in chunks of 65536 into a widened 64-bit accumulator with no per-element check: a chunk of <64-bit values cannot overflow the 64-bit accumulator (2^16 * (2^32-1) < 2^64), so only one checked add per chunk is needed. This lets the inner loop vectorize to packed widening adds (paddq + unpck). 64-bit inputs keep the per-element checked path since a chunk of 64-bit values could itself overflow. This observes overflow at chunk boundaries rather than per element, so a signed sum whose running total transiently leaves i64 range but ends in range now returns the true total instead of null. The final result is unchanged whenever the existing per-batch combine did not already overflow. Benchmarked over 100k elements: sum_i32 ~19us, sum_u32 ~15us, sum_i64 ~51us. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
The with-nulls paths for primitive sum and min/max walked the values zipped with a per-element validity bit, which kept them scalar even though their all-valid counterparts already autovectorize to packed widening adds (sum) and packed min/max. Drive both null paths from the validity mask's contiguous `[start, end)` runs (`Mask::slices`, computed once and cached). Each run is fully valid, so it reuses the existing vectorized all-valid reduction: sum folds each run through the chunked widening accumulator; min/max folds the native per-run integer min/max, with floats chaining the runs through the NaN-filtering reduction. Results are unchanged. To support the fold, integer min/max now returns native `(min, max)` (`integer_min_max_raw`) which both the all-valid and run paths reduce before building the result scalar. Benchmarked over 100k i32 elements (added nullable bench cases): - clustered nulls: sum 106us -> 29us, max 159us -> 35us - scattered ~50% nulls: no regression (sum 584us -> 530us, max 606us -> 593us) Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
Adds a divan benchmark exercising the min/max aggregation over primitive
arrays (i32/i64/f64, with and without nulls) so we can measure and inspect
the codegen of the max reduction path.
Signed-off-by: Joe Isaacs joe.isaacs@live.co.uk