Skip to content

Bench jemalloc metadata_thp (alloc tuning) #3

Bench jemalloc metadata_thp (alloc tuning)

Bench jemalloc metadata_thp (alloc tuning) #3

name: Bench jemalloc metadata_thp (alloc tuning)
# Manual benchmark to validate the baked `metadata_thp:auto` default on Linux,
# where Transparent Huge Pages exist (the effect is inert on macOS/Windows, so it
# cannot be measured there). Also runs the opt-in `percpu_arena:percpu` EXPERIMENT
# that stacks on top of metadata_thp — bake it only if this shows a clear, RSS-safe win.
#
# metadata_thp puts jemalloc's INTERNAL metadata (rtree / extent structures) on huge
# pages to cut TLB misses. That metadata grows with the number of live allocations, so
# the realistic NYC311 1M sample (neutral, ~modest heap) does not stress it. This bench
# therefore generates a SYNTHETIC HIGH-CARDINALITY dataset (two ~N-distinct columns) that
# deliberately drives qsv's cardinality/frequency hashmaps into the millions of entries —
# i.e. the exact metadata-pressure regime metadata_thp is meant to help. It is a stress
# test, not a realistic workload: a null result here is strong evidence the lever is inert
# for qsv; a win argues for keeping it on large/high-cardinality jobs.
#
# A single release binary is built; the variable under test is toggled purely via the
# `_RJEM_MALLOC_CONF` env override (which takes precedence over, and merges key-by-key,
# with the baked malloc_conf default). The jemalloc run-time levers (background_thread +
# dirty/muzzy decay) stay ON in every cell, so only metadata_thp / percpu_arena varies.
#
# cells (per command):
# baseline — _RJEM_MALLOC_CONF=metadata_thp:disabled (pre-PR behaviour)
# thp — baked default (metadata_thp:auto) (this PR)
# thp+pcpu — _RJEM_MALLOC_CONF=percpu_arena:percpu (keeps baked thp; experiment)
#
# Context: jemalloc TUNING.md (metadata_thp / percpu_arena) | prior alloc tuning: PR #3948
#
# NOTE: GitHub-hosted runners are shared/noisy (no thermal pinning, noisy neighbours).
# The signal under test can be near the runner noise floor — read the +/- sigma and the
# peak-RSS table, not single means.
on:
workflow_dispatch:
inputs:
rows:
description: "synthetic high-cardinality rows (heap/metadata pressure)"
required: false
default: "8000000"
runs:
description: "hyperfine runs per cell"
required: false
default: "7"
warmup:
description: "hyperfine warmup runs"
required: false
default: "2"
permissions:
contents: read
jobs:
bench:
name: metadata_thp A/B/C on ubuntu-latest
runs-on: ubuntu-latest
env:
DATA: synth_highcard.csv
HYPERFINE_VERSION: "1.18.0"
THP_OFF: "metadata_thp:disabled"
PERCPU: "percpu_arena:percpu"
ROWS: ${{ github.event.inputs.rows }}
steps:
- uses: actions/checkout@v6
- name: Install build & bench deps
run: |
sudo apt-get update
sudo apt-get install -y libwayland-dev
# hyperfine is not reliably in apt; pull the official .deb
curl -fsSL -o /tmp/hyperfine.deb \
"https://github.com/sharkdp/hyperfine/releases/download/v${HYPERFINE_VERSION}/hyperfine_${HYPERFINE_VERSION}_amd64.deb"
sudo dpkg -i /tmp/hyperfine.deb
hyperfine --version
- name: Install Rust toolchain
uses: dtolnay/rust-toolchain@master
with:
toolchain: stable
targets: x86_64-unknown-linux-gnu
- name: Setup Rust cache
uses: Swatinem/rust-cache@v2
with:
key: qsv-metadata-thp-bench
- name: Show THP state + runner memory
run: |
echo "### Runner Transparent Huge Pages state" >> "$GITHUB_STEP_SUMMARY"
echo '```' >> "$GITHUB_STEP_SUMMARY"
echo "enabled: $(cat /sys/kernel/mm/transparent_hugepage/enabled 2>/dev/null || echo unavailable)" | tee -a "$GITHUB_STEP_SUMMARY"
echo "defrag: $(cat /sys/kernel/mm/transparent_hugepage/defrag 2>/dev/null || echo unavailable)" | tee -a "$GITHUB_STEP_SUMMARY"
free -h | tee -a "$GITHUB_STEP_SUMMARY"
echo '```' >> "$GITHUB_STEP_SUMMARY"
- name: Generate synthetic high-cardinality dataset
run: |
# gawk srand(seed) is reproducible across runs. Columns: id (fully unique,
# cardinality == ROWS), hi (~ROWS distinct), mid (~100k), lo (~100), plus two
# numeric columns to exercise quartile/median value vectors. id+hi drive the
# cardinality/frequency hashmaps into the millions of entries.
gawk -v rows="$ROWS" 'BEGIN{
srand(42);
print "id,hi,mid,lo,num1,num2";
for (i = 0; i < rows; i++) {
printf "%d,%d,%d,%d,%d,%.4f\n", \
i, int(rand()*2000000000), int(rand()*100000), int(rand()*100), \
int(rand()*1000000), rand()*1000;
}
}' > "$DATA"
ls -lh "$DATA"
echo "rows: $(($(wc -l < "$DATA") - 1)) cols: $(head -1 "$DATA" | tr ',' '\n' | wc -l)"
- name: Build release binary
run: |
cargo build --release --bin qsv -F feature_capable
cp target/release/qsv /tmp/qsv
- name: Sanity — baked metadata_thp default + env toggle
run: |
echo "### Binary sanity" >> "$GITHUB_STEP_SUMMARY"
echo '```' >> "$GITHUB_STEP_SUMMARY"
echo "default: $(/tmp/qsv --version)" | tee -a "$GITHUB_STEP_SUMMARY"
# baked metadata_thp:auto must surface as +thp on Linux
if /tmp/qsv --version | grep -q '+thp'; then
echo "baked metadata_thp:auto ACTIVE (+thp) ✓" | tee -a "$GITHUB_STEP_SUMMARY"
else
echo "::error::baked metadata_thp:auto NOT detected (+thp missing) — symbol wiring broken"
echo '```' >> "$GITHUB_STEP_SUMMARY"; exit 1
fi
# env override must win and drop the marker
if env _RJEM_MALLOC_CONF="$THP_OFF" /tmp/qsv --version | grep -q '+thp'; then
echo "::error::_RJEM_MALLOC_CONF override did NOT disable metadata_thp — precedence broken"
echo '```' >> "$GITHUB_STEP_SUMMARY"; exit 1
else
echo "_RJEM_MALLOC_CONF=metadata_thp:disabled override works ✓" | tee -a "$GITHUB_STEP_SUMMARY"
fi
echo '```' >> "$GITHUB_STEP_SUMMARY"
- name: Confirm cardinality (metadata-pressure regime)
run: |
echo "### Column cardinality (drives metadata pressure)" >> "$GITHUB_STEP_SUMMARY"
echo '```' >> "$GITHUB_STEP_SUMMARY"
/tmp/qsv stats --cardinality "$DATA" | /tmp/qsv select field,cardinality - | tee -a "$GITHUB_STEP_SUMMARY"
echo '```' >> "$GITHUB_STEP_SUMMARY"
- name: Index dataset (parallel path)
run: /tmp/qsv index "$DATA"
- name: Parity check — stats output must be byte-identical across cells
run: |
env _RJEM_MALLOC_CONF="$THP_OFF" /tmp/qsv stats -E -c 0 "$DATA" > s_base.csv
/tmp/qsv stats -E -c 0 "$DATA" > s_thp.csv
env _RJEM_MALLOC_CONF="$PERCPU" /tmp/qsv stats -E -c 0 "$DATA" > s_pcpu.csv
if cmp s_base.csv s_thp.csv && cmp s_base.csv s_pcpu.csv; then
echo "stats -E byte-identical across cells ✓" | tee -a "$GITHUB_STEP_SUMMARY"
else
echo "::error::stats -E output DIFFERS across allocator cells — correctness bug, not a perf question"
exit 1
fi
- name: Benchmark — frequency (A/B/C)
run: |
mkdir -p bench-results
hyperfine --warmup "${{ github.event.inputs.warmup }}" --runs "${{ github.event.inputs.runs }}" -N \
--export-markdown bench-results/frequency.md \
--export-json bench-results/frequency.json \
-n "baseline" "env _RJEM_MALLOC_CONF=$THP_OFF /tmp/qsv frequency $DATA" \
-n "thp" "/tmp/qsv frequency $DATA" \
-n "thp+pcpu" "env _RJEM_MALLOC_CONF=$PERCPU /tmp/qsv frequency $DATA"
- name: Benchmark — stats -E (A/B/C)
run: |
hyperfine --warmup "${{ github.event.inputs.warmup }}" --runs "${{ github.event.inputs.runs }}" -N \
--export-markdown bench-results/stats-E.md \
--export-json bench-results/stats-E.json \
-n "baseline" "env _RJEM_MALLOC_CONF=$THP_OFF /tmp/qsv stats -E -c 0 $DATA" \
-n "thp" "/tmp/qsv stats -E -c 0 $DATA" \
-n "thp+pcpu" "env _RJEM_MALLOC_CONF=$PERCPU /tmp/qsv stats -E -c 0 $DATA"
- name: Peak RSS & CPU time (metadata_thp tradeoff)
run: |
# metadata_thp trades a small metadata-memory increase for fewer TLB misses.
# Capture peak RSS + user/sys time SYNCHRONOUSLY (write to a file, then parse) —
# the previous process-substitution approach raced and lost its output.
measure () {
local label="$1"; shift
/usr/bin/time -v "$@" >/dev/null 2>/tmp/time.txt || true
local rss user sys
rss=$(awk -F': ' '/Maximum resident set size/{print $2}' /tmp/time.txt)
user=$(awk -F': ' '/User time/{print $2}' /tmp/time.txt)
sys=$(awk -F': ' '/System time/{print $2}' /tmp/time.txt)
printf '%-10s peakRSS=%8s KB user=%6ss sys=%6ss\n' "$label" "$rss" "$user" "$sys" \
| tee -a "$GITHUB_STEP_SUMMARY"
}
echo "### stats -E -c 0 — peak RSS & CPU time (single shot each)" >> "$GITHUB_STEP_SUMMARY"
echo '```' >> "$GITHUB_STEP_SUMMARY"
measure "baseline" env _RJEM_MALLOC_CONF="$THP_OFF" /tmp/qsv stats -E -c 0 "$DATA"
measure "thp" /tmp/qsv stats -E -c 0 "$DATA"
measure "thp+pcpu" env _RJEM_MALLOC_CONF="$PERCPU" /tmp/qsv stats -E -c 0 "$DATA"
echo '```' >> "$GITHUB_STEP_SUMMARY"
- name: Render results to job summary
if: always()
run: |
{
echo "## frequency (indexed, parallel path)"
cat bench-results/frequency.md
echo ""
echo "## stats -E (-c 0)"
cat bench-results/stats-E.md
echo ""
echo "**Primary comparison:** \`thp\` vs \`baseline\` on a deliberately high-cardinality"
echo "workload. Faster (or tied within noise) with acceptable peak-RSS delta validates"
echo "the baked metadata_thp:auto default; a null result here (where metadata pressure is"
echo "maximal) is strong evidence the lever is inert for qsv."
echo "**Experiment:** \`thp+pcpu\` vs \`thp\` — only bake percpu_arena (behind an"
echo "off-by-default feature) if it shows a clear, repeatable, RSS-safe win."
} >> "$GITHUB_STEP_SUMMARY"
- name: Upload raw results
if: always()
uses: actions/upload-artifact@v4
with:
name: bench-metadata-thp-${{ github.run_id }}
path: bench-results/