WebUI ships a layered benchmark suite for measuring SSR rendering performance. Each layer answers a different question, so a thorough performance investigation runs multiple benches before & after a change and compares.
This document is the reference for what to run, when to run it, and how to compare results.
| Bench | Layer | Wall time | What it measures | Use when |
|---|---|---|---|---|
cargo xtask bench all |
criterion micro | ~5 min | per-fn wall-clock for parser, handler, protocol, expressions, state, webui (incl. streaming + contact-book) | full snapshot of every micro-bench |
cargo xtask bench streaming |
criterion micro | ~60 s | writer-path wall-clock + first-chunk TTFB | inner-loop iteration on the streaming module |
cargo xtask bench contact-book |
criterion micro | ~90 s | end-to-end render at 10/100/1000 contacts | inner-loop iteration on handler/state/expressions |
cargo xtask bench streaming-resource |
example | ~30 s | exact alloc count + bytes + getrusage CPU + RSS | proving zero-alloc claims; allocation regression hunting |
cargo xtask bench streaming-e2e-ttfb |
example | ~10 s | HTTP-level TTFB / TTLB through actix | confirming wire-level streaming win |
cargo xtask bench streaming-browser |
Playwright | ~30 s | real Chromium TTFB / FCP / LCP / DCL / load | proving user-perceived paint improvement |
cargo xtask bench full (= streaming-all) |
suite | ~3 min | runs all four streaming-related benches in sequence | full streaming evidence pack for a PR |
All benches support named baselines. The flag pattern is identical across criterion, example, and Playwright benches:
# 1. Snapshot current numbers as 'before'
cargo xtask bench full --save-baseline before
# 2. Make your change …
# 3. Compare against 'before'
cargo xtask bench full --baseline beforeBaselines are stored at target/bench-baselines/:
streaming-resource-<name>.json— alloc + RSS + CPU tablee2e-ttfb-<name>.json— HTTP TTFB/TTLB tablebrowser-<name>.json— browser metrics tabletarget/criterion/<bench>/<name>— criterion's native baseline directory tree
The compare phase prints a Δ%-table for every row. Negative Δ% = improvement; positive = regression.
| Source | Treat as noise | Treat as signal |
|---|---|---|
| criterion (well-isolated wall-clock) | < ±2% | > ±5% |
| streaming-resource (alloc count) | exact — any change matters | any non-zero |
| streaming-resource (bytes, CPU) | < ±2% | > ±5% |
| streaming-e2e-ttfb (loopback) | < ±10% | > ±20% |
| streaming-browser (real Chromium) | < ±5% | > ±15% |
Standard criterion harnesses. Each crate has its own benches/ dir:
crates/webui-parser/benches/parser_bench.rscrates/webui-protocol/benches/protocol_bench.rscrates/webui-handler/benches/handler_bench.rscrates/webui-expressions/benches/expressions_bench.rscrates/webui-state/benches/state_bench.rscrates/webui/benches/contact_book_bench.rs— end-to-end rendercrates/webui/benches/streaming_bench.rs— writer-path wall-clock + TTFB
These integrate with criterion's HTML reports
(target/criterion/report/index.html) and native baseline support
(--save-baseline NAME / --baseline NAME). cargo xtask bench
passes those flags through so you don't need to remember cargo bench invocation details.
crates/webui/examples/streaming_resource_bench.rs installs a custom
GlobalAlloc that exact-counts every alloc/realloc call. Why an
example, not a criterion bench? Criterion's harness allocates during
its sampling loop, which would pollute a counting allocator. Examples
run a clean process where every alloc we observe came from the code
under test (or its dependencies).
Reports per (path × scale):
- allocs/run — exact count from the custom allocator
- bytes/run — exact bytes requested from the allocator
- wall µs/run —
Instant::elapsed()per iteration - user µs/run —
getrusage(RUSAGE_SELF).ru_utimedelta - process RSS —
ru_maxrsshigh-water mark
This is the only bench in the suite that gives you exact allocation numbers. Use it to verify "zero per-write allocation" claims and to detect allocation-pressure regressions.
crates/webui/examples/streaming_e2e_ttfb_bench.rs spawns a real
actix-web server with /buf and /stream endpoints, then drives
both with the awc HTTP client. Reports min/p50/p99 for both TTFB
(time to first byte) and TTLB (time to last byte) at four
render-cost scenarios.
Faster than the browser bench (~10 s vs ~30 s) and doesn't need Chromium installed. Use it as the smoke check before paying for the full browser bench.
examples/integration/streaming-browser-bench/ is a separate package
with its own actix server and a Playwright spec that drives Chromium
through PerformanceObserver. Reports the only browser-perceived
metrics in the suite:
- TTFB —
responseStart - requestStartfromPerformanceNavigationTiming - FCP — first-contentful-paint from
PerformanceObserver - LCP — largest-contentful-paint from
PerformanceObserver - DCL —
domContentLoadedEventEnd - startTime - load —
loadEventEnd - startTime
This is the bench that answers "does streaming actually help users see the page faster?" The HTTP-level benches prove the bytes get to the wire faster; only this one proves Chrome paints faster.
The spec also asserts a hard regression check: at the 100 ms render scenario, streaming TTFB must be ≥5× lower than buffered TTFB. If that ever fails, something is fundamentally wrong with the implementation.
For any change touching crates/webui/src/streaming.rs or its
callers:
# 1. Establish baseline on the unmodified code
cargo xtask bench full --save-baseline before
# 2. Make your change
# 3. Compare
cargo xtask bench full --baseline before
# 4. Paste the four Δ%-tables into the PR descriptionFor changes touching the handler / parser / state / protocol / expressions crates:
cargo xtask bench all --save-baseline before
# … change …
cargo xtask bench all --baseline beforeThe criterion --baseline flag emits the per-bench change: lines
inline (e.g. Performance has improved / regressed / within noise threshold).
- Stdout — every bench prints a human-readable table.
- JSON snapshots — non-criterion benches write to
target/bench-baselines/. - Criterion HTML —
target/criterion/report/index.htmlfor full PDF/CDF plots and per-baseline violin plots.
Each layer measures a different thing. A change can:
- improve allocation count but regress wall-clock (allocator changes)
- improve micro-bench wall-clock but regress browser FCP (chunk-size changes that hurt parser progressive rendering)
- improve TTFB but introduce a memory leak (no cleanup of pool buffers on error paths)
Running the full suite catches all of these. Running just one layer catches one third of them.
- Close other applications — CPU-intensive background work adds noise.
- Plug in to power (laptops) — battery savers throttle the CPU.
- Pin to release builds —
cargo benchandcargo xtask benchalways use release; debug builds are not representative. - Run on the same machine — cross-machine baselines are not meaningful.
- Compare medians (P50), not means — robust against thermal spikes.
- Re-run if Dev% > 15% in any criterion row.
If you add a new performance-sensitive feature, also add a benchmark. The bar:
- Criterion if the unit-of-work is a single function call. Add a
[[bench]]entry to the relevant crate'sCargo.toml. - Example with
--save NAME/--compare NAMEif you need process-wide measurement (custom allocator, getrusage, an HTTP server, etc.). Mirror the structure ofstreaming_resource_bench.rs. - Playwright if the metric is browser-perceived (paint, layout,
hydration time). Mirror the structure of
examples/integration/streaming-browser-bench/.
Wire it into cargo xtask bench so the standard before/after
workflow works without users needing to know per-bench invocation
details.