Skip to content

bench(expression): Add widening cast benchmarks#2171

Open
yingsu00 wants to merge 2 commits into
IBM:boltfrom
yingsu00:cast-perf-02-widening-cast-benchmark
Open

bench(expression): Add widening cast benchmarks#2171
yingsu00 wants to merge 2 commits into
IBM:boltfrom
yingsu00:cast-perf-02-widening-cast-benchmark

Conversation

@yingsu00

Copy link
Copy Markdown
Collaborator

No description provided.

yingsu00 added 2 commits June 18, 2026 21:42
…ast evaluation

Adds a folly benchmark harness for CAST(...) over a stable dictionary
base, used to evaluate evalWithMemo improvements on real param
matrices without rebuilding test scaffolding ad-hoc.

The benchmark sweeps the dimensions that drive cast performance:
- numVectors          : input vectors evaluated per iteration
                        (fixed at 1000 in the matrix below;
                        not encoded in entry names)
- rowsPerVector       : rows in each input vector
- distinctValueCount  : dictionary base cardinality (DICT only)
- newIndicesPerVector : new dictionary indices each input vector
                        introduces relative to the previous one -
                        drives the memoization cache miss rate
                        (DICT only)
- nullPct             : percentage of positions marked null in
                        {0, 50, 100}. For DICT, nulls live on the
                        dictionary wrap (reach
                        PeeledEncoding::translateToInnerRows via
                        wrapNulls_); for FLAT, on the flat input
                        itself. nullPct=100 short-circuits the cast
                        to a null constant and surfaces the absolute
                        measurement floor.

For each cast pair the matrix registers 27 dictionary-input
combinations (rv x dvc x nipv) and 3 flat baselines (rv). Each
source line through the DICT_NULLS / FLAT_NULLS wrapper macros
expands into three benchmark entries - one per nullPct value - so
the source stays the same length but every existing combination has
three null variants registered.

Cast pairs covered:
- BIGINT  -> VARCHAR
- INT     -> BIGINT
- DATE    -> VARCHAR
- DATE    -> TIMESTAMP
- REAL    -> DOUBLE

Entry names read:
  DICT_<from>To<to>(rowsPerVector_distinctValueCount_newIndicesPerVector_nullPct)
  FLAT_<from>To<to>(rowsPerVector_nullPct)

Wires velox/expression/benchmarks into the build under the existing
VELOX_ENABLE_BENCHMARKS gate (it was never added before) and lands
the velox_benchmark_widening_cast target.

Sample at FLAT_IntToBigint shows how nullPct affects per-row time:
  FLAT_IntToBigint(100_0)     12.84 ns/row   no nulls
  FLAT_IntToBigint(100_50)    16.70 ns/row   half rows null
  FLAT_IntToBigint(100_100)    8.30 ns/row   all-nulls short-circuit

A short investigation triggered by entries like
DICT_IntToBigint(10000_5_1_0) printing "0.00fs Infinity":

These are not measurement bugs. Folly's runBenchmarkGetNSPerIteration
subtracts a globally-measured baseline (cost of an empty BENCHMARK
loop, ~0.5 ns/iter) and floors the result at zero
(Benchmark.cpp:207). For configurations where the dictionary cache
covers the whole base in the first couple of batches - small dvc
relative to rowsPerVector, so the peelEncodings bypass kicks in for
~998 of 1000 batches - the per-row cast cost falls below the
baseline. "0.00fs" reads as "below folly's resolution after baseline
subtraction", not actual zero. The corresponding nullPct=100 entry
(which doesn't hit the bypass: translateToInnerRows returns an empty
inner-rows set, so dictionaryCache_ never populates) still measures
around 400-500 ps/iter, showing the resolution floor folly can
report. Added the explanation to the file's header comment so the
next reader doesn't have to re-derive it.

While here, also pin the cast result with folly::doNotOptimizeAway
before reading ->size(), so the compiler can't drop any of the
intermediate state from the cast call. The numbers didn't change -
the side effects on dictionaryCache_ already prevented DCE - but
the pin makes the intent explicit and is a defensible default for
any cast benchmark.

Verified: with the suspender excluding setup, runDictionary's cast
loop runs in ~500 us per batch-1000 (measured via direct
steady_clock during investigation), which is 50 ps/row for the
10000-row config. That's below folly's baseline (~500 ps) so the
display floors to 0.00fs.
… date_format to WideningCastBenchmark

Extends the existing single-base, single-thread benchmark with three
dimensions that the original was missing - each models a real
behavior production sees but the original could not measure.

Multi-base alternation. The original always reuses one dictionary
base FlatVector across all 1000 batches per iteration. That measures
only the steady-state cache-hit cost; it never exercises the
numMemoBaseChange path of evalWithMemo, which a real scan operator
hits every time its underlying storage chunk advances. Add a
batchesPerBase parameter and build ceil(numVectors /
batchesPerBase) distinct base FlatVectors with distinct content; the
i-th batch wraps base [i / batchesPerBase]. Sweep {1, 10, 100}
across the higher-signal shapes per type pair via DICT_ALT_*
macros. Entry names get a _bpb<n> suffix; entries without that
suffix retain the original single-base semantics (= 1000, matching
numVectors).

Multi-thread. The production refcount cost on the source Buffer is
a cross-driver effect: many threads each constructing a
DictionaryVector that wraps the same file-cache-shared Buffer
issues 2 atomics per batch per thread on one cache line, and that
line bounces between L1 caches. The original benchmark cannot
reproduce this. Add runDictionaryMultiThread that spawns N worker
threads, each with its own ExecCtx and ExprSet (because Expr
evaluation is not thread-safe) but all evaluating against the
shared base FlatVectors built on the calling thread. Entry name
format: `MT_<funcName>(threads<n>_<rv>_<dvc>_<nipv>_<nullPct>
_bpb<n>)`. Sweep numThreads in {4, 16} and batchesPerBase in
{1, 100, 1000} for the 3 type pairs and the production
expression where the cross-driver pattern matters in prod
(BigintToVarchar, DateToVarchar, DateToTimestamp, DateFormatProd).

Production expression. Add DICT_DateFormatProd /
FLAT_DateFormatProd / MT_DICT_DateFormatProd that exercise the
exact shape from the slow query that motivated this work:

  date_format(CAST(date_trunc('day',
    date_add('day', 0 - mod(((day_of_week(c0) % 7) - 1) + 7, 7),
      c0)) AS timestamp), '%Y-%m-%d')

This is a DATE -> VARCHAR chain dominated by the date_format
result's stringBuffers_ refcount churn under evalWithMemo. Sweep
the same (rv, dvc) shapes as the alt entries; include both
single-base and _bpb sweeps so the cache-hit floor and the
realistic mix are both observable.

The original ~430 entries (single-base sweep across 5 type pairs +
flat baselines + nullPcts) are preserved unchanged so existing
measurements remain comparable.

Build and ran a representative subset of new entries
(DICT_BigintToVarchar/DICT_DateFormatProd at bpb=10, and
MT_DICT_DateFormatProd at threads=4/16) at --bm_min_iters=1 to
confirm correct registration and execution.
@yingsu00 yingsu00 requested a review from xin-zhang2 June 23, 2026 03:38
@yingsu00 yingsu00 self-assigned this Jun 23, 2026
@yingsu00 yingsu00 requested a review from majetideepak as a code owner June 23, 2026 03:38
@yingsu00 yingsu00 added the bolt label Jun 23, 2026
@yingsu00 yingsu00 removed the request for review from majetideepak June 23, 2026 03:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant