bench(expression): Add widening cast benchmarks by yingsu00 · Pull Request #2171 · IBM/velox

yingsu00 · 2026-06-23T03:38:19Z

No description provided.

…ast evaluation Adds a folly benchmark harness for CAST(...) over a stable dictionary base, used to evaluate evalWithMemo improvements on real param matrices without rebuilding test scaffolding ad-hoc. The benchmark sweeps the dimensions that drive cast performance: - numVectors : input vectors evaluated per iteration (fixed at 1000 in the matrix below; not encoded in entry names) - rowsPerVector : rows in each input vector - distinctValueCount : dictionary base cardinality (DICT only) - newIndicesPerVector : new dictionary indices each input vector introduces relative to the previous one - drives the memoization cache miss rate (DICT only) - nullPct : percentage of positions marked null in {0, 50, 100}. For DICT, nulls live on the dictionary wrap (reach PeeledEncoding::translateToInnerRows via wrapNulls_); for FLAT, on the flat input itself. nullPct=100 short-circuits the cast to a null constant and surfaces the absolute measurement floor. For each cast pair the matrix registers 27 dictionary-input combinations (rv x dvc x nipv) and 3 flat baselines (rv). Each source line through the DICT_NULLS / FLAT_NULLS wrapper macros expands into three benchmark entries - one per nullPct value - so the source stays the same length but every existing combination has three null variants registered. Cast pairs covered: - BIGINT -> VARCHAR - INT -> BIGINT - DATE -> VARCHAR - DATE -> TIMESTAMP - REAL -> DOUBLE Entry names read: DICT_<from>To<to>(rowsPerVector_distinctValueCount_newIndicesPerVector_nullPct) FLAT_<from>To<to>(rowsPerVector_nullPct) Wires velox/expression/benchmarks into the build under the existing VELOX_ENABLE_BENCHMARKS gate (it was never added before) and lands the velox_benchmark_widening_cast target. Sample at FLAT_IntToBigint shows how nullPct affects per-row time: FLAT_IntToBigint(100_0) 12.84 ns/row no nulls FLAT_IntToBigint(100_50) 16.70 ns/row half rows null FLAT_IntToBigint(100_100) 8.30 ns/row all-nulls short-circuit A short investigation triggered by entries like DICT_IntToBigint(10000_5_1_0) printing "0.00fs Infinity": These are not measurement bugs. Folly's runBenchmarkGetNSPerIteration subtracts a globally-measured baseline (cost of an empty BENCHMARK loop, ~0.5 ns/iter) and floors the result at zero (Benchmark.cpp:207). For configurations where the dictionary cache covers the whole base in the first couple of batches - small dvc relative to rowsPerVector, so the peelEncodings bypass kicks in for ~998 of 1000 batches - the per-row cast cost falls below the baseline. "0.00fs" reads as "below folly's resolution after baseline subtraction", not actual zero. The corresponding nullPct=100 entry (which doesn't hit the bypass: translateToInnerRows returns an empty inner-rows set, so dictionaryCache_ never populates) still measures around 400-500 ps/iter, showing the resolution floor folly can report. Added the explanation to the file's header comment so the next reader doesn't have to re-derive it. While here, also pin the cast result with folly::doNotOptimizeAway before reading ->size(), so the compiler can't drop any of the intermediate state from the cast call. The numbers didn't change - the side effects on dictionaryCache_ already prevented DCE - but the pin makes the intent explicit and is a defensible default for any cast benchmark. Verified: with the suspender excluding setup, runDictionary's cast loop runs in ~500 us per batch-1000 (measured via direct steady_clock during investigation), which is 50 ps/row for the 10000-row config. That's below folly's baseline (~500 ps) so the display floors to 0.00fs.

… date_format to WideningCastBenchmark Extends the existing single-base, single-thread benchmark with three dimensions that the original was missing - each models a real behavior production sees but the original could not measure. Multi-base alternation. The original always reuses one dictionary base FlatVector across all 1000 batches per iteration. That measures only the steady-state cache-hit cost; it never exercises the numMemoBaseChange path of evalWithMemo, which a real scan operator hits every time its underlying storage chunk advances. Add a batchesPerBase parameter and build ceil(numVectors / batchesPerBase) distinct base FlatVectors with distinct content; the i-th batch wraps base [i / batchesPerBase]. Sweep {1, 10, 100} across the higher-signal shapes per type pair via DICT_ALT_* macros. Entry names get a _bpb<n> suffix; entries without that suffix retain the original single-base semantics (= 1000, matching numVectors). Multi-thread. The production refcount cost on the source Buffer is a cross-driver effect: many threads each constructing a DictionaryVector that wraps the same file-cache-shared Buffer issues 2 atomics per batch per thread on one cache line, and that line bounces between L1 caches. The original benchmark cannot reproduce this. Add runDictionaryMultiThread that spawns N worker threads, each with its own ExecCtx and ExprSet (because Expr evaluation is not thread-safe) but all evaluating against the shared base FlatVectors built on the calling thread. Entry name format: `MT_<funcName>(threads<n>_<rv>_<dvc>_<nipv>_<nullPct> _bpb<n>)`. Sweep numThreads in {4, 16} and batchesPerBase in {1, 100, 1000} for the 3 type pairs and the production expression where the cross-driver pattern matters in prod (BigintToVarchar, DateToVarchar, DateToTimestamp, DateFormatProd). Production expression. Add DICT_DateFormatProd / FLAT_DateFormatProd / MT_DICT_DateFormatProd that exercise the exact shape from the slow query that motivated this work: date_format(CAST(date_trunc('day', date_add('day', 0 - mod(((day_of_week(c0) % 7) - 1) + 7, 7), c0)) AS timestamp), '%Y-%m-%d') This is a DATE -> VARCHAR chain dominated by the date_format result's stringBuffers_ refcount churn under evalWithMemo. Sweep the same (rv, dvc) shapes as the alt entries; include both single-base and _bpb sweeps so the cache-hit floor and the realistic mix are both observable. The original ~430 entries (single-base sweep across 5 type pairs + flat baselines + nullPcts) are preserved unchanged so existing measurements remain comparable. Build and ran a representative subset of new entries (DICT_BigintToVarchar/DICT_DateFormatProd at bpb=10, and MT_DICT_DateFormatProd at threads=4/16) at --bm_min_iters=1 to confirm correct registration and execution.

yingsu00 added 2 commits June 18, 2026 21:42

yingsu00 requested a review from xin-zhang2 June 23, 2026 03:38

yingsu00 self-assigned this Jun 23, 2026

yingsu00 requested a review from majetideepak as a code owner June 23, 2026 03:38

yingsu00 added the bolt label Jun 23, 2026

yingsu00 removed the request for review from majetideepak June 23, 2026 03:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

bench(expression): Add widening cast benchmarks#2171

bench(expression): Add widening cast benchmarks#2171
yingsu00 wants to merge 2 commits into
IBM:boltfrom
yingsu00:cast-perf-02-widening-cast-benchmark

yingsu00 commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

yingsu00 commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant