Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
25 changes: 25 additions & 0 deletions Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -174,6 +174,31 @@ name = "bench_blob_manifest"
path = "examples/bench_blob_manifest.rs"
required-features = ["bench"]

# Blob download read-ahead benchmark — sweeps the local backend's chunk prefetch
# depth (read_prefetch / buffered(N)) under disk-bound vs network-bound consumers
# and warm vs cold page cache. No Postgres needed.
[[example]]
name = "bench_blob_prefetch"
path = "examples/bench_blob_prefetch.rs"
required-features = ["bench"]

# Tokio runtime tuning benchmark — worker over-subscription (throughput + p99
# under CPU contention) and blocking-pool RSS blast radius, default vs tuned.
# No Postgres needed; run under `taskset -c 0,1` to model a 2-core quota.
[[example]]
name = "bench_tokio_runtime"
path = "examples/bench_tokio_runtime.rs"
required-features = ["bench"]

# CPU pool concurrency benchmark — thumbnail decode throughput + p99 as the
# decode-permit count is swept, showing the effect of sizing the image pools to
# the CFS quota (effective_parallelism) vs the host core count. No Postgres;
# run under `taskset -c 0,1` to model a 2-core quota.
[[example]]
name = "bench_pool_concurrency"
path = "examples/bench_pool_concurrency.rs"
required-features = ["bench"]

# ACL owner-cache benchmark — owner query vs moka hit (needs the dev Postgres up).
[[example]]
name = "bench_owner_cache"
Expand Down
53 changes: 53 additions & 0 deletions benches/BLOB-PREFETCH.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
# Blob download read-ahead benchmark

Measures the local backend's chunk read-ahead depth — the `buffered(N)`
read-ahead in `DedupService::stream_chunks` fed by
`BlobStorageBackend::read_prefetch()`. Reproduces the exact production reassembly
combinator (`stream::iter(hashes).map(get_blob_stream).buffered(N).try_flatten()`)
over a real `LocalBlobBackend` whose chunk files are scattered across the 256
hash-prefix dirs, then drains it and reports throughput. `N = 1` is the old
production default ("antes"); higher N is the change ("después").

## Reproduce

```bash
cargo run --release --features bench --example bench_blob_prefetch
# tunables: BENCH_FILE_MB=192 BENCH_CHUNK_KB=256 BENCH_PREFETCH=1,2,4,8,16
# BENCH_THROTTLE_MBPS=0,300,100 BENCH_REPS=5 BENCH_COLD=1
```

## Results (4-core box, SSD-class storage, 192 MiB in 768×256 KiB chunks)

Median MB/s over 5 reps; `vs N=1` is the read-ahead gain over the old default.

| scenario | N=1 | N=2 | N=4 | N=8 | N=16 |
|-------------------------|----:|-----:|-----:|-----:|-----:|
| warm / unthrottled |1306 | **1460** |1394 |1357 |1248 |
| cold / unthrottled | 456 | **489** | 469 | 478 | 473 |
| warm / throttled@300MB/s| 167 | 166 | 166 | 166 | 166 |
| cold / throttled@300MB/s| 138 | 140 | 135 | 135 | 136 |
| warm / throttled@100MB/s| 62 | 62 | 62 | 62 | 62 |
| cold / throttled@100MB/s| 57 | 58 | 57 | 57 | 57 |

(`vs N=1` for the best column N=2: warm/unthrottled **+11.8 %**, cold/unthrottled
**+7.2 %**; throttled rows ≈ 0 %. N=16 regresses warm −4.4 %.)

## Conclusions

1. **N=2 is the sweet spot, not 8.** It wins or ties in 5 of 6 scenarios at the
lowest fan-out: +11.8 % warm and +7.2 % cold on disk-bound reads, neutral when
the consumer is the bottleneck. N=8 gives only +3.9 %/+4.8 %; N=16 regresses.
So local now defaults to 2 (was 1); S3/Azure keep 8 (request-latency bound).

2. **The win is disk-bound, not network-bound.** Throttled (network-bound) rows
are flat because `buffered(N)` here overlaps the per-chunk `File::open`
(cheap on local disk), **not** the data read (which `try_flatten` polls
sequentially). The disk-bound rows cover localhost/LAN downloads *and* the
internal blob reads that drain as fast as the disk delivers — thumbnail
render, transcode, ZIP export, content extraction — all via `stream_chunks`.

3. **No cold regression on SSD.** The trait doc's "slower cold" worry (concurrent
opens → random I/O over scattered chunk files) is an HDD seek-thrash concern;
on SSD-class storage cold reads *improved* at N=2. Operators on spinning disks
can restore the old behaviour with `OXICLOUD_LOCAL_READ_PREFETCH=1`; NVMe
arrays can raise it.
77 changes: 77 additions & 0 deletions benches/POOL-CONCURRENCY.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
# CPU pool concurrency benchmark — thumbnail decode under a CPU quota

Measures what `effective_parallelism()` changes for the image pools
(`ThumbnailService::max_concurrent_decodes`, `image_transcode_service`, `di.rs`
video ffmpeg fan-out): the number of concurrent CPU-heavy renders permitted.
Those pools used to size from `available_parallelism()`, which ignores the CFS
quota (`--cpus` / cgroup `cpu.max`), so under a container quota they permit one
render per *host* core. Drives the **real service path** — `Semaphore(K)` gating
`spawn_blocking(ThumbnailService::bench_render_all)` with a gallery of concurrent
callers — and sweeps the permit count K, measuring throughput, p50/p99, and peak
RSS for K concurrent decodes.

## Reproduce

```bash
cargo build --release --features bench --example bench_pool_concurrency
taskset -c 0,1 ./target/release/examples/bench_pool_concurrency # model a 2-core quota
# tunables: BENCH_K_LIST=1,2,4,8,16 BENCH_GALLERY=48 BENCH_SECONDS=4
```

## Results (4-core box, pinned to 2 cores; image: synthetic 48 MP JPEG)

### [A] Throughput + tail latency (48 concurrent gallery callers)

| permits | renders/s | p50 ms | p99 ms |
|--------:|----------:|-------:|-------:|
| 1 | 16.5 | 7342 | 10370 |
| 2 (effective) | 20.0 | 5009 | 5816 |
| 4 | 20.8 | 4895 | 5536 |
| 8 | 20.0 | 4784 | 5685 |
| 16 | 18.0 | 4576 | 6140 |

### [B] Peak RSS, K concurrent decodes (one wave)

| permits | peak RSS MiB |
|--------:|-------------:|
| 1 | 137 |
| 2 | 137 |
| 4 | 137 |
| 8 | 137 |
| 16 | 137 |

## Conclusions

1. **The thumbnail-decode pool is not a bottleneck — over-permitting costs
nothing measurable here.** Throughput is flat from K=2 to K=8 (CPU-bound: two
cores stay saturated regardless), p99 barely moves, and **peak RSS is flat at
137 MiB across K=1..16**. K=1 under-utilises (one decode can't fill two cores);
K=16 is marginally worse on throughput/p99. So sizing this pool to the CFS
quota neither gains nor loses on this workload.

2. **This confirms the codebase's own design.** `thumbnail_service.rs` documents
that *shrink-on-load* (DCT-scaled decode straight to thumbnail size, ~18–25 MB
regardless of source resolution) is why the historical concurrency throttle
was removed — "the RAM ceiling no longer forces throttling and we can saturate
every core". The flat RSS is exactly that: each concurrent decode's transient
buffer is small, so 16 in flight cost the same resident memory as 1.

3. **Decision: NOT migrated (reverted).** Because the only pool this bench could
isolate showed zero measured benefit, the `effective_parallelism()` migration
of the image pools was reverted — adding code without a measured win isn't
worth it. The `effective_parallelism()` helper stays (it has a *measured*
benefit in the Tokio runtime — see `RUNTIME`), so a future, deliberately
measured case can adopt it per-pool.
The one pool with a plausible a-priori argument is the **ffmpeg video
fan-out** (one heavyweight OS process per permit — 32 ffmpeg processes for a
2-core budget on a many-core host is self-evidently wasteful). That was left
on `available_parallelism()` too, to revisit *with* a measurement if a
high-host-core / low-quota deployment running video thumbnails ever warrants
it. The transcode rayon pool over-sizing only costs parked thread stacks
(negligible).

4. **Honest caveat on scale.** This was run at a 2-core quota on a 4-core host
(K_oversub = 8 ≈ 4×). On a 64-core host under a 2-core quota the host-count
permit would be 64 (32× over), where even small per-decode costs and scheduler
pressure add up — the regime this change protects against but which this box
can't reproduce.
93 changes: 93 additions & 0 deletions benches/RUNTIME.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
# Tokio runtime tuning benchmark

Measures the two things `build_runtime` (`src/main.rs`) changes versus the bare
`#[tokio::main]` defaults, sized by `common::runtime::runtime_pool_sizes`:

- **Worker count.** `#[tokio::main]` defaults to `available_parallelism()`, which
honours CPU *affinity* (`sched_getaffinity`: cpuset, `taskset`) but **ignores
the CFS bandwidth quota** (`docker --cpus`, cgroup v2 `cpu.max`, v1
`cpu.cfs_quota_us`). On a 2-core-quota container on a many-core host it spawns
one worker per *host* core. `effective_parallelism()` folds the quota back in.
- **Blocking pool.** `#[tokio::main]` defaults to a flat `max_blocking_threads =
512` — a multi-GB RSS blast radius for this heavy `spawn_blocking` user
(thumbnails, transcode, zip, PDF/text extraction, Argon2 ≈19 MB/hash). The
builder caps it at `max(32, 8 × workers)`.

## Reproduce

```bash
cargo build --release --features bench --example bench_tokio_runtime
# Pin to 2 cores to model a 2-core CPU quota on a bigger host:
taskset -c 0,1 ./target/release/examples/bench_tokio_runtime
# Part B uses a fixed glibc mmap threshold for a clean RSS read:
MALLOC_MMAP_THRESHOLD_=131072 MALLOC_TRIM_THRESHOLD_=131072 \
taskset -c 0,1 ./target/release/examples/bench_tokio_runtime
# tunables: BENCH_CONCURRENCY=96 BENCH_SECONDS=4 BENCH_BURN_KB=256
# BENCH_WORKERS_BEFORE=32 BENCH_BLOCKING_TASKS=96 BENCH_ALLOC_MB=16 BENCH_MAX_BLOCKING_AFTER=16
```

## Results (4-core box, pinned to 2 cores via `taskset -c 0,1`)

### [A] Worker over-subscription under CPU contention

96 concurrent async "requests", each an async hop + a 256 KiB BLAKE3 (models a
handler that interleaves I/O with on-worker compute), over 4 s.

| runtime | req/s | p50 µs | p99 µs |
|-----------------------|-------:|-------:|-------:|
| before: 32 workers | 46 854 | 121 | 60 360 |
| after: 2 workers | 42 893 | 2 140 | 4 962 |

→ **throughput −8.5 %, p99 latency −91.8 %** (after vs before)

### [B] Blocking-pool RSS blast radius

96 concurrent `spawn_blocking` tasks, 16 MiB resident each, held 120 ms
(fixed glibc mmap threshold so freed allocations leave RSS promptly).

| max_blocking_threads | peak RSS MiB | vs default |
|-----------------------------|-------------:|-----------:|
| before: 512 (tokio default) | 1 231 | — |
| after: 16 (bounded) | 261 | −970 MiB |

## Conclusions

1. **Blocking-pool cap — clear win, no downside.** Bounding 512→16 cut peak RSS
under a 96-task flood from **1231 MiB to 261 MiB (−970 MiB)**. The cap only
engages under a pile-up; steady-state operation is unaffected, and the app's
heaviest blocking consumers are already semaphore-limited (Argon2 = 2,
thumbnail decode ≈ cores), so `max(32, 8×workers)` is generous headroom that
simply removes the unbounded tail that can OOM-kill the process under a spike.

2. **Worker sizing — a latency/throughput trade, favourable for a server.**
Over-subscription (32 workers on 2 cores, what tokio's default does under a
CFS quota) won **+8.5 % peak throughput** but at a **catastrophic p99 of
60 ms** (12× the tuned 5 ms) with a bimodal distribution — some requests fly
(p50 121 µs), others starve. Sizing to the quota (2 workers) gives uniform,
predictable latency at a small throughput cost. For an interactive file
server, p99 dominates UX (timeouts, head-of-line blocking), so this is the
right trade.

3. **This microbenchmark is a worst case *for* the tuned config.** It is pure
on-worker CPU, which is exactly where over-subscription's throughput edge
shows. Real OxiCloud handlers push CPU to `spawn_blocking` and the async
workers mostly await I/O (DB, disk) — there the over-subscription throughput
edge evaporates (idle workers just park) while its tail-latency penalty
remains. Production should see the worker change as ≥ neutral on throughput
and strictly better on tail latency.

4. **No regression off-quota.** `effective_parallelism()` == `available_
parallelism()` whenever there is no CFS quota (or affinity already restricts
the process), so on bare metal / affinity-pinned deployments the worker count
is unchanged from the old default. The change only bites under a CFS quota —
precisely the case it fixes.

5. **Follow-up:** the same `available_parallelism()` blind spot affects the
image/rayon pools (`thumbnail_service.rs`, `image_transcode_service.rs`,
`di.rs` video) — they over-spawn under a CFS quota too. Switching those to
`common::runtime::effective_parallelism()` is the natural next step (left out
here to keep this change focused on the runtime).

Both knobs are env-overridable (`OXICLOUD_WORKER_THREADS` /
`OXICLOUD_MAX_BLOCKING_THREADS`) and logged at startup ("Tokio runtime pools
sized"), so operators can see and tune what is in effect.
Loading
Loading