bench(expression): Add widening cast benchmarks#2171
Open
yingsu00 wants to merge 2 commits into
Open
Conversation
…ast evaluation
Adds a folly benchmark harness for CAST(...) over a stable dictionary
base, used to evaluate evalWithMemo improvements on real param
matrices without rebuilding test scaffolding ad-hoc.
The benchmark sweeps the dimensions that drive cast performance:
- numVectors : input vectors evaluated per iteration
(fixed at 1000 in the matrix below;
not encoded in entry names)
- rowsPerVector : rows in each input vector
- distinctValueCount : dictionary base cardinality (DICT only)
- newIndicesPerVector : new dictionary indices each input vector
introduces relative to the previous one -
drives the memoization cache miss rate
(DICT only)
- nullPct : percentage of positions marked null in
{0, 50, 100}. For DICT, nulls live on the
dictionary wrap (reach
PeeledEncoding::translateToInnerRows via
wrapNulls_); for FLAT, on the flat input
itself. nullPct=100 short-circuits the cast
to a null constant and surfaces the absolute
measurement floor.
For each cast pair the matrix registers 27 dictionary-input
combinations (rv x dvc x nipv) and 3 flat baselines (rv). Each
source line through the DICT_NULLS / FLAT_NULLS wrapper macros
expands into three benchmark entries - one per nullPct value - so
the source stays the same length but every existing combination has
three null variants registered.
Cast pairs covered:
- BIGINT -> VARCHAR
- INT -> BIGINT
- DATE -> VARCHAR
- DATE -> TIMESTAMP
- REAL -> DOUBLE
Entry names read:
DICT_<from>To<to>(rowsPerVector_distinctValueCount_newIndicesPerVector_nullPct)
FLAT_<from>To<to>(rowsPerVector_nullPct)
Wires velox/expression/benchmarks into the build under the existing
VELOX_ENABLE_BENCHMARKS gate (it was never added before) and lands
the velox_benchmark_widening_cast target.
Sample at FLAT_IntToBigint shows how nullPct affects per-row time:
FLAT_IntToBigint(100_0) 12.84 ns/row no nulls
FLAT_IntToBigint(100_50) 16.70 ns/row half rows null
FLAT_IntToBigint(100_100) 8.30 ns/row all-nulls short-circuit
A short investigation triggered by entries like
DICT_IntToBigint(10000_5_1_0) printing "0.00fs Infinity":
These are not measurement bugs. Folly's runBenchmarkGetNSPerIteration
subtracts a globally-measured baseline (cost of an empty BENCHMARK
loop, ~0.5 ns/iter) and floors the result at zero
(Benchmark.cpp:207). For configurations where the dictionary cache
covers the whole base in the first couple of batches - small dvc
relative to rowsPerVector, so the peelEncodings bypass kicks in for
~998 of 1000 batches - the per-row cast cost falls below the
baseline. "0.00fs" reads as "below folly's resolution after baseline
subtraction", not actual zero. The corresponding nullPct=100 entry
(which doesn't hit the bypass: translateToInnerRows returns an empty
inner-rows set, so dictionaryCache_ never populates) still measures
around 400-500 ps/iter, showing the resolution floor folly can
report. Added the explanation to the file's header comment so the
next reader doesn't have to re-derive it.
While here, also pin the cast result with folly::doNotOptimizeAway
before reading ->size(), so the compiler can't drop any of the
intermediate state from the cast call. The numbers didn't change -
the side effects on dictionaryCache_ already prevented DCE - but
the pin makes the intent explicit and is a defensible default for
any cast benchmark.
Verified: with the suspender excluding setup, runDictionary's cast
loop runs in ~500 us per batch-1000 (measured via direct
steady_clock during investigation), which is 50 ps/row for the
10000-row config. That's below folly's baseline (~500 ps) so the
display floors to 0.00fs.
… date_format to WideningCastBenchmark
Extends the existing single-base, single-thread benchmark with three
dimensions that the original was missing - each models a real
behavior production sees but the original could not measure.
Multi-base alternation. The original always reuses one dictionary
base FlatVector across all 1000 batches per iteration. That measures
only the steady-state cache-hit cost; it never exercises the
numMemoBaseChange path of evalWithMemo, which a real scan operator
hits every time its underlying storage chunk advances. Add a
batchesPerBase parameter and build ceil(numVectors /
batchesPerBase) distinct base FlatVectors with distinct content; the
i-th batch wraps base [i / batchesPerBase]. Sweep {1, 10, 100}
across the higher-signal shapes per type pair via DICT_ALT_*
macros. Entry names get a _bpb<n> suffix; entries without that
suffix retain the original single-base semantics (= 1000, matching
numVectors).
Multi-thread. The production refcount cost on the source Buffer is
a cross-driver effect: many threads each constructing a
DictionaryVector that wraps the same file-cache-shared Buffer
issues 2 atomics per batch per thread on one cache line, and that
line bounces between L1 caches. The original benchmark cannot
reproduce this. Add runDictionaryMultiThread that spawns N worker
threads, each with its own ExecCtx and ExprSet (because Expr
evaluation is not thread-safe) but all evaluating against the
shared base FlatVectors built on the calling thread. Entry name
format: `MT_<funcName>(threads<n>_<rv>_<dvc>_<nipv>_<nullPct>
_bpb<n>)`. Sweep numThreads in {4, 16} and batchesPerBase in
{1, 100, 1000} for the 3 type pairs and the production
expression where the cross-driver pattern matters in prod
(BigintToVarchar, DateToVarchar, DateToTimestamp, DateFormatProd).
Production expression. Add DICT_DateFormatProd /
FLAT_DateFormatProd / MT_DICT_DateFormatProd that exercise the
exact shape from the slow query that motivated this work:
date_format(CAST(date_trunc('day',
date_add('day', 0 - mod(((day_of_week(c0) % 7) - 1) + 7, 7),
c0)) AS timestamp), '%Y-%m-%d')
This is a DATE -> VARCHAR chain dominated by the date_format
result's stringBuffers_ refcount churn under evalWithMemo. Sweep
the same (rv, dvc) shapes as the alt entries; include both
single-base and _bpb sweeps so the cache-hit floor and the
realistic mix are both observable.
The original ~430 entries (single-base sweep across 5 type pairs +
flat baselines + nullPcts) are preserved unchanged so existing
measurements remain comparable.
Build and ran a representative subset of new entries
(DICT_BigintToVarchar/DICT_DateFormatProd at bpb=10, and
MT_DICT_DateFormatProd at threads=4/16) at --bm_min_iters=1 to
confirm correct registration and execution.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.