Add cast pushdown optimization for bit-packed integer widening#8046
Add cast pushdown optimization for bit-packed integer widening#8046joseph-isaacs wants to merge 8 commits into
Conversation
Widening a bit-packed narrow integer column to a wider type (e.g. u16 -> u32) currently has no cast pushdown: cast(bit_packed) canonicalizes to a full-length narrow PrimitiveArray and then casts it, allocating two full-length buffers and round-tripping the narrow intermediate through RAM. Add `BitUnpackedChunks::decode_cast_into`, which unpacks each 1024-element FastLanes chunk into the existing cache-resident scratch buffer and maps each value through a closure into a differently-typed output, plus `unpack_and_cast_into_builder` which uses it to unpack straight into a wide PrimitiveBuilder (handling validity and patches). Add a divan benchmark (cast_bitpacked) comparing the current canonicalize-then-cast path against the pushdown, over single and chunked arrays, with and without patches. Signed-off-by: "Joe Isaacs" <joe.isaacs@live.co.uk>
Extend BitPacked's CastKernel so that widening integer casts (e.g. u16 -> u32) dispatch to the unpack-and-cast pushdown automatically, instead of falling back to canonicalize-then-cast. The cast is gated to strictly wider integer targets where every bit-packable value is representable (unsigned source, or signed-to-signed), so no per-value bounds check is needed. Update the cast_bitpacked benchmark to measure the real array.cast(u32).execute() path alongside an explicit canonicalize-then-cast baseline and the direct helper. Signed-off-by: "Joe Isaacs" <joe.isaacs@live.co.uk>
Merging this PR will not alter performance
|
| Mode | Benchmark | BASE |
HEAD |
Efficiency | |
|---|---|---|---|---|---|
| ⚡ | Simulation | baseline_eq[16, 65536] |
287.6 µs | 259.5 µs | +10.8% |
| ❌ | Simulation | fast_lt_out_of_range[4, 65536] |
204.3 µs | 262.4 µs | -22.12% |
| ⚡ | Simulation | baseline_lt[16, 65536] |
302.7 µs | 274.7 µs | +10.2% |
Tip
Investigate this regression by commenting @codspeedbot fix this regression on this PR, or directly use the CodSpeed MCP with your agent.
Comparing claude/cast-bitpacked-pushdown-VNtVh (6603c3f) with develop (c54ce7e)
Signed-off-by: "Joe Isaacs" <joe.isaacs@live.co.uk>
Generalize apply_patches_to_uninit_range_fn to a cross-type Fn(S) -> T so the cast pushdown reuses it instead of a near-identical copy, and drop the redundant identity wrapper. Behaviour and performance are unchanged. Signed-off-by: "Joe Isaacs" <joe.isaacs@live.co.uk>
Replace the direct-kernel and direct-helper cast tests with a single end-to-end test that drives array.cast(target).execute(), proving the public Vortex path dispatches to BitPacked's widening pushdown across all supported integer pairs, chunk-boundary lengths, and a sliced case. Signed-off-by: "Joe Isaacs" <joe.isaacs@live.co.uk>
Polar Signals Profiling ResultsLatest Run
Powered by Polar Signals Cloud |
Benchmarks: PolarSignals ProfilingVortex (geomean): 1.022x ➖ datafusion / vortex-file-compressed (1.022x ➖, 0↑ 1↓)
|
File Sizes: PolarSignals ProfilingNo file size changes detected. |
Benchmarks: FineWeb NVMeVerdict: No clear signal (low confidence) datafusion / vortex-file-compressed (0.879x ✅, 4↑ 0↓)
datafusion / vortex-compact (0.881x ✅, 8↑ 0↓)
datafusion / parquet (0.917x ➖, 6↑ 0↓)
duckdb / vortex-file-compressed (1.093x ➖, 0↑ 2↓)
duckdb / vortex-compact (0.933x ➖, 4↑ 0↓)
duckdb / parquet (0.925x ➖, 6↑ 0↓)
Full attributed analysis
|
File Sizes: FineWeb NVMeNo file size changes detected. |
Benchmarks: TPC-H SF=1 on NVMEVerdict: No clear signal (environment too noisy confidence) datafusion / vortex-file-compressed (1.000x ➖, 0↑ 0↓)
datafusion / vortex-compact (1.005x ➖, 0↑ 1↓)
datafusion / parquet (1.006x ➖, 2↑ 1↓)
datafusion / arrow (1.011x ➖, 0↑ 1↓)
duckdb / vortex-file-compressed (0.993x ➖, 0↑ 0↓)
duckdb / vortex-compact (0.997x ➖, 0↑ 0↓)
duckdb / parquet (1.032x ➖, 1↑ 2↓)
duckdb / duckdb (1.008x ➖, 0↑ 0↓)
Full attributed analysis
|
File Sizes: TPC-H SF=1 on NVMENo file size changes detected. |
Benchmarks: TPC-DS SF=1 on NVMEVerdict: No clear signal (low confidence) datafusion / vortex-file-compressed (1.001x ➖, 0↑ 1↓)
datafusion / vortex-compact (1.008x ➖, 1↑ 2↓)
datafusion / parquet (1.006x ➖, 0↑ 1↓)
duckdb / vortex-file-compressed (1.004x ➖, 0↑ 0↓)
duckdb / vortex-compact (1.001x ➖, 2↑ 2↓)
duckdb / parquet (1.004x ➖, 0↑ 0↓)
duckdb / duckdb (1.003x ➖, 1↑ 2↓)
Full attributed analysis
|
File Sizes: TPC-DS SF=1 on NVMENo file size changes detected. |
Benchmarks: FineWeb S3Verdict: No clear signal (environment too noisy confidence) datafusion / vortex-file-compressed (0.999x ➖, 0↑ 0↓)
datafusion / vortex-compact (0.957x ➖, 1↑ 0↓)
datafusion / parquet (0.919x ➖, 0↑ 0↓)
duckdb / vortex-file-compressed (0.878x ➖, 1↑ 0↓)
duckdb / vortex-compact (0.958x ➖, 0↑ 0↓)
duckdb / parquet (0.992x ➖, 0↑ 0↓)
Full attributed analysis
|
Benchmarks: Statistical and Population GeneticsVerdict: No clear signal (low confidence) duckdb / vortex-file-compressed (1.059x ➖, 0↑ 3↓)
duckdb / vortex-compact (1.019x ➖, 0↑ 0↓)
duckdb / parquet (1.037x ➖, 0↑ 1↓)
Full attributed analysis
|
File Sizes: Statistical and Population GeneticsNo file size changes detected. |
Benchmarks: Random AccessVortex (geomean): 0.970x ➖ unknown / unknown (0.981x ➖, 1↑ 0↓)
|
Benchmarks: TPC-H SF=10 on NVMEVerdict: No clear signal (medium confidence) datafusion / vortex-file-compressed (0.897x ✅, 11↑ 0↓)
datafusion / vortex-compact (0.859x ✅, 20↑ 0↓)
datafusion / parquet (0.878x ✅, 18↑ 0↓)
datafusion / arrow (1.142x ❌, 1↑ 17↓)
duckdb / vortex-file-compressed (1.035x ➖, 0↑ 3↓)
duckdb / vortex-compact (1.166x ❌, 0↑ 21↓)
duckdb / parquet (1.001x ➖, 0↑ 0↓)
duckdb / duckdb (1.100x ➖, 0↑ 11↓)
Full attributed analysis
|
File Sizes: TPC-H SF=10 on NVMENo file size changes detected. |
Benchmarks: Clickbench on NVMEVerdict: No clear signal (low confidence) datafusion / vortex-file-compressed (0.983x ➖, 1↑ 0↓)
datafusion / parquet (0.992x ➖, 0↑ 0↓)
duckdb / vortex-file-compressed (0.983x ➖, 4↑ 3↓)
duckdb / parquet (0.995x ➖, 0↑ 0↓)
duckdb / duckdb (0.997x ➖, 0↑ 0↓)
Full attributed analysis
|
File Sizes: Clickbench on NVMEFile Size Changes (1 files changed, -0.0% overall, 0↑ 1↓)
Totals:
|
Benchmarks: TPC-H SF=1 on S3Verdict: No clear signal (environment too noisy confidence) datafusion / vortex-file-compressed (0.920x ➖, 1↑ 0↓)
datafusion / vortex-compact (0.934x ➖, 0↑ 0↓)
datafusion / parquet (0.945x ➖, 2↑ 0↓)
duckdb / vortex-file-compressed (0.957x ➖, 0↑ 0↓)
duckdb / vortex-compact (0.924x ➖, 0↑ 0↓)
duckdb / parquet (0.957x ➖, 0↑ 0↓)
Full attributed analysis
|
Benchmarks: CompressionVortex (geomean): 1.012x ➖ unknown / unknown (1.024x ➖, 1↑ 14↓)
|
Benchmarks: TPC-H SF=10 on S3Verdict: No clear signal (environment too noisy confidence) datafusion / vortex-file-compressed (1.003x ➖, 1↑ 1↓)
datafusion / vortex-compact (1.010x ➖, 0↑ 1↓)
datafusion / parquet (0.798x ➖, 4↑ 1↓)
duckdb / vortex-file-compressed (1.040x ➖, 0↑ 0↓)
duckdb / vortex-compact (1.084x ➖, 0↑ 1↓)
duckdb / parquet (0.974x ➖, 0↑ 1↓)
Full attributed analysis
|
Summary
This PR implements a "cast pushdown" optimization for widening casts on bit-packed integer columns (e.g.,
u16 -> u32). Rather than canonicalizing to a full-length intermediate array and then casting it, the optimization unpacks each FastLanes chunk into a cache-resident scratch buffer and casts values directly into the output buffer during decompression.Running locally I get