Bench jemalloc metadata_thp (alloc tuning) #4
Workflow file for this run
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| name: Bench jemalloc metadata_thp (alloc tuning) | |
| # Manual benchmark to validate the baked `metadata_thp:auto` default on Linux, | |
| # where Transparent Huge Pages exist (the effect is inert on macOS/Windows, so it | |
| # cannot be measured there). Also runs the opt-in `percpu_arena:percpu` EXPERIMENT | |
| # that stacks on top of metadata_thp — bake it only if this shows a clear, RSS-safe win. | |
| # | |
| # metadata_thp puts jemalloc's INTERNAL metadata (rtree / extent structures) on huge | |
| # pages to cut TLB misses. That metadata grows with the number of live allocations, so | |
| # the realistic NYC311 1M sample (neutral, ~modest heap) does not stress it. This bench | |
| # therefore generates a SYNTHETIC HIGH-CARDINALITY dataset (two ~N-distinct columns) that | |
| # deliberately drives qsv's cardinality/frequency hashmaps into the millions of entries — | |
| # i.e. the exact metadata-pressure regime metadata_thp is meant to help. It is a stress | |
| # test, not a realistic workload: a null result here is strong evidence the lever is inert | |
| # for qsv; a win argues for keeping it on large/high-cardinality jobs. | |
| # | |
| # A single release binary is built; the variable under test is toggled purely via the | |
| # `_RJEM_MALLOC_CONF` env override (which takes precedence over, and merges key-by-key, | |
| # with the baked malloc_conf default). The jemalloc run-time levers (background_thread + | |
| # dirty/muzzy decay) stay ON in every cell, so only metadata_thp / percpu_arena varies. | |
| # | |
| # cells (per command): | |
| # baseline — _RJEM_MALLOC_CONF=metadata_thp:disabled (pre-PR behaviour) | |
| # thp — baked default (metadata_thp:auto) (this PR) | |
| # thp+pcpu — _RJEM_MALLOC_CONF=percpu_arena:percpu (keeps baked thp; experiment) | |
| # | |
| # Context: jemalloc TUNING.md (metadata_thp / percpu_arena) | prior alloc tuning: PR #3948 | |
| # | |
| # NOTE: GitHub-hosted runners are shared/noisy (no thermal pinning, noisy neighbours). | |
| # The signal under test can be near the runner noise floor — read the +/- sigma and the | |
| # peak-RSS table, not single means. | |
| on: | |
| workflow_dispatch: | |
| inputs: | |
| rows: | |
| description: "synthetic high-cardinality rows (heap/metadata pressure)" | |
| required: false | |
| default: "8000000" | |
| runs: | |
| description: "hyperfine runs per cell" | |
| required: false | |
| default: "7" | |
| warmup: | |
| description: "hyperfine warmup runs" | |
| required: false | |
| default: "2" | |
| permissions: | |
| contents: read | |
| jobs: | |
| bench: | |
| name: metadata_thp A/B/C on ubuntu-latest | |
| runs-on: ubuntu-latest | |
| env: | |
| DATA: synth_highcard.csv | |
| HYPERFINE_VERSION: "1.18.0" | |
| THP_OFF: "metadata_thp:disabled" | |
| PERCPU: "percpu_arena:percpu" | |
| ROWS: ${{ github.event.inputs.rows }} | |
| steps: | |
| - uses: actions/checkout@v6 | |
| - name: Install build & bench deps | |
| run: | | |
| sudo apt-get update | |
| sudo apt-get install -y libwayland-dev | |
| # hyperfine is not reliably in apt; pull the official .deb | |
| curl -fsSL -o /tmp/hyperfine.deb \ | |
| "https://github.com/sharkdp/hyperfine/releases/download/v${HYPERFINE_VERSION}/hyperfine_${HYPERFINE_VERSION}_amd64.deb" | |
| sudo dpkg -i /tmp/hyperfine.deb | |
| hyperfine --version | |
| - name: Install Rust toolchain | |
| uses: dtolnay/rust-toolchain@master | |
| with: | |
| toolchain: stable | |
| targets: x86_64-unknown-linux-gnu | |
| - name: Setup Rust cache | |
| uses: Swatinem/rust-cache@v2 | |
| with: | |
| key: qsv-metadata-thp-bench | |
| - name: Show THP state + runner memory | |
| run: | | |
| echo "### Runner Transparent Huge Pages state" >> "$GITHUB_STEP_SUMMARY" | |
| echo '```' >> "$GITHUB_STEP_SUMMARY" | |
| echo "enabled: $(cat /sys/kernel/mm/transparent_hugepage/enabled 2>/dev/null || echo unavailable)" | tee -a "$GITHUB_STEP_SUMMARY" | |
| echo "defrag: $(cat /sys/kernel/mm/transparent_hugepage/defrag 2>/dev/null || echo unavailable)" | tee -a "$GITHUB_STEP_SUMMARY" | |
| free -h | tee -a "$GITHUB_STEP_SUMMARY" | |
| echo '```' >> "$GITHUB_STEP_SUMMARY" | |
| - name: Generate synthetic high-cardinality dataset | |
| run: | | |
| # gawk srand(seed) is reproducible across runs. Columns: id (fully unique, | |
| # cardinality == ROWS), hi (~ROWS distinct), mid (~100k), lo (~100), plus two | |
| # numeric columns to exercise quartile/median value vectors. id+hi drive the | |
| # cardinality/frequency hashmaps into the millions of entries. | |
| gawk -v rows="$ROWS" 'BEGIN{ | |
| srand(42); | |
| print "id,hi,mid,lo,num1,num2"; | |
| for (i = 0; i < rows; i++) { | |
| printf "%d,%d,%d,%d,%d,%.4f\n", \ | |
| i, int(rand()*2000000000), int(rand()*100000), int(rand()*100), \ | |
| int(rand()*1000000), rand()*1000; | |
| } | |
| }' > "$DATA" | |
| ls -lh "$DATA" | |
| echo "rows: $(($(wc -l < "$DATA") - 1)) cols: $(head -1 "$DATA" | tr ',' '\n' | wc -l)" | |
| - name: Build release binary | |
| run: | | |
| cargo build --release --bin qsv -F feature_capable | |
| cp target/release/qsv /tmp/qsv | |
| - name: Sanity — baked metadata_thp default + env toggle | |
| run: | | |
| echo "### Binary sanity" >> "$GITHUB_STEP_SUMMARY" | |
| echo '```' >> "$GITHUB_STEP_SUMMARY" | |
| echo "default: $(/tmp/qsv --version)" | tee -a "$GITHUB_STEP_SUMMARY" | |
| # baked metadata_thp:auto must surface as +thp on Linux | |
| if /tmp/qsv --version | grep -q '+thp'; then | |
| echo "baked metadata_thp:auto ACTIVE (+thp) ✓" | tee -a "$GITHUB_STEP_SUMMARY" | |
| else | |
| echo "::error::baked metadata_thp:auto NOT detected (+thp missing) — symbol wiring broken" | |
| echo '```' >> "$GITHUB_STEP_SUMMARY"; exit 1 | |
| fi | |
| # env override must win and drop the marker | |
| if env _RJEM_MALLOC_CONF="$THP_OFF" /tmp/qsv --version | grep -q '+thp'; then | |
| echo "::error::_RJEM_MALLOC_CONF override did NOT disable metadata_thp — precedence broken" | |
| echo '```' >> "$GITHUB_STEP_SUMMARY"; exit 1 | |
| else | |
| echo "_RJEM_MALLOC_CONF=metadata_thp:disabled override works ✓" | tee -a "$GITHUB_STEP_SUMMARY" | |
| fi | |
| echo '```' >> "$GITHUB_STEP_SUMMARY" | |
| - name: Confirm cardinality (metadata-pressure regime) | |
| run: | | |
| echo "### Column cardinality (drives metadata pressure)" >> "$GITHUB_STEP_SUMMARY" | |
| echo '```' >> "$GITHUB_STEP_SUMMARY" | |
| /tmp/qsv stats --cardinality "$DATA" | /tmp/qsv select field,cardinality - | tee -a "$GITHUB_STEP_SUMMARY" | |
| echo '```' >> "$GITHUB_STEP_SUMMARY" | |
| - name: Index dataset (parallel path) | |
| run: /tmp/qsv index "$DATA" | |
| - name: Parity check — frequency must be byte-identical across cells | |
| run: | | |
| # Use `frequency` (deterministic integer counts), NOT `stats -E`: extended stats | |
| # include float aggregates whose parallel reduction order is non-deterministic | |
| # run-to-run INDEPENDENT of the allocator, so a byte-exact `cmp` would false-fail. | |
| # The allocator can never change computed values — frequency parity is the correct | |
| # invariant to assert here. | |
| env _RJEM_MALLOC_CONF="$THP_OFF" /tmp/qsv frequency "$DATA" > f_base.csv | |
| /tmp/qsv frequency "$DATA" > f_thp.csv | |
| env _RJEM_MALLOC_CONF="$PERCPU" /tmp/qsv frequency "$DATA" > f_pcpu.csv | |
| if cmp f_base.csv f_thp.csv && cmp f_base.csv f_pcpu.csv; then | |
| echo "frequency byte-identical across cells ✓" | tee -a "$GITHUB_STEP_SUMMARY" | |
| else | |
| echo "::error::frequency output DIFFERS across allocator cells — correctness bug, not a perf question" | |
| exit 1 | |
| fi | |
| - name: Benchmark — frequency (A/B/C) | |
| run: | | |
| mkdir -p bench-results | |
| hyperfine --warmup "${{ github.event.inputs.warmup }}" --runs "${{ github.event.inputs.runs }}" -N \ | |
| --export-markdown bench-results/frequency.md \ | |
| --export-json bench-results/frequency.json \ | |
| -n "baseline" "env _RJEM_MALLOC_CONF=$THP_OFF /tmp/qsv frequency $DATA" \ | |
| -n "thp" "/tmp/qsv frequency $DATA" \ | |
| -n "thp+pcpu" "env _RJEM_MALLOC_CONF=$PERCPU /tmp/qsv frequency $DATA" | |
| - name: Benchmark — stats -E (A/B/C) | |
| run: | | |
| hyperfine --warmup "${{ github.event.inputs.warmup }}" --runs "${{ github.event.inputs.runs }}" -N \ | |
| --export-markdown bench-results/stats-E.md \ | |
| --export-json bench-results/stats-E.json \ | |
| -n "baseline" "env _RJEM_MALLOC_CONF=$THP_OFF /tmp/qsv stats -E -c 0 $DATA" \ | |
| -n "thp" "/tmp/qsv stats -E -c 0 $DATA" \ | |
| -n "thp+pcpu" "env _RJEM_MALLOC_CONF=$PERCPU /tmp/qsv stats -E -c 0 $DATA" | |
| - name: Peak RSS & CPU time (metadata_thp tradeoff) | |
| run: | | |
| # metadata_thp trades a small metadata-memory increase for fewer TLB misses. | |
| # Capture peak RSS + user/sys time SYNCHRONOUSLY (write to a file, then parse) — | |
| # the previous process-substitution approach raced and lost its output. | |
| measure () { | |
| local label="$1"; shift | |
| /usr/bin/time -v "$@" >/dev/null 2>/tmp/time.txt || true | |
| local rss user sys | |
| rss=$(awk -F': ' '/Maximum resident set size/{print $2}' /tmp/time.txt) | |
| user=$(awk -F': ' '/User time/{print $2}' /tmp/time.txt) | |
| sys=$(awk -F': ' '/System time/{print $2}' /tmp/time.txt) | |
| printf '%-10s peakRSS=%8s KB user=%6ss sys=%6ss\n' "$label" "$rss" "$user" "$sys" \ | |
| | tee -a "$GITHUB_STEP_SUMMARY" | |
| } | |
| echo "### stats -E -c 0 — peak RSS & CPU time (single shot each)" >> "$GITHUB_STEP_SUMMARY" | |
| echo '```' >> "$GITHUB_STEP_SUMMARY" | |
| measure "baseline" env _RJEM_MALLOC_CONF="$THP_OFF" /tmp/qsv stats -E -c 0 "$DATA" | |
| measure "thp" /tmp/qsv stats -E -c 0 "$DATA" | |
| measure "thp+pcpu" env _RJEM_MALLOC_CONF="$PERCPU" /tmp/qsv stats -E -c 0 "$DATA" | |
| echo '```' >> "$GITHUB_STEP_SUMMARY" | |
| - name: Render results to job summary | |
| if: always() | |
| run: | | |
| { | |
| echo "## frequency (indexed, parallel path)" | |
| cat bench-results/frequency.md | |
| echo "" | |
| echo "## stats -E (-c 0)" | |
| cat bench-results/stats-E.md | |
| echo "" | |
| echo "**Primary comparison:** \`thp\` vs \`baseline\` on a deliberately high-cardinality" | |
| echo "workload. Faster (or tied within noise) with acceptable peak-RSS delta validates" | |
| echo "the baked metadata_thp:auto default; a null result here (where metadata pressure is" | |
| echo "maximal) is strong evidence the lever is inert for qsv." | |
| echo "**Experiment:** \`thp+pcpu\` vs \`thp\` — only bake percpu_arena (behind an" | |
| echo "off-by-default feature) if it shows a clear, repeatable, RSS-safe win." | |
| } >> "$GITHUB_STEP_SUMMARY" | |
| - name: Upload raw results | |
| if: always() | |
| uses: actions/upload-artifact@v4 | |
| with: | |
| name: bench-metadata-thp-${{ github.run_id }} | |
| path: bench-results/ |