AtalayaLabs · DioCrafts · Jun 22, 2026 · Jun 22, 2026 · Jun 22, 2026 · Jun 22, 2026
@@ -174,6 +174,31 @@ name = "bench_blob_manifest"
 path = "examples/bench_blob_manifest.rs"
 required-features = ["bench"]
 
+# Blob download read-ahead benchmark — sweeps the local backend's chunk prefetch
+# depth (read_prefetch / buffered(N)) under disk-bound vs network-bound consumers
+# and warm vs cold page cache. No Postgres needed.
+[[example]]
+name = "bench_blob_prefetch"
+path = "examples/bench_blob_prefetch.rs"
+required-features = ["bench"]
+
+# Tokio runtime tuning benchmark — worker over-subscription (throughput + p99
+# under CPU contention) and blocking-pool RSS blast radius, default vs tuned.
+# No Postgres needed; run under `taskset -c 0,1` to model a 2-core quota.
+[[example]]
+name = "bench_tokio_runtime"
+path = "examples/bench_tokio_runtime.rs"
+required-features = ["bench"]
+
+# CPU pool concurrency benchmark — thumbnail decode throughput + p99 as the
+# decode-permit count is swept, showing the effect of sizing the image pools to
+# the CFS quota (effective_parallelism) vs the host core count. No Postgres;
+# run under `taskset -c 0,1` to model a 2-core quota.
+[[example]]
+name = "bench_pool_concurrency"
+path = "examples/bench_pool_concurrency.rs"
+required-features = ["bench"]
+
 # ACL owner-cache benchmark — owner query vs moka hit (needs the dev Postgres up).
 [[example]]
 name = "bench_owner_cache"

@@ -0,0 +1,53 @@
+# Blob download read-ahead benchmark
+
+Measures the local backend's chunk read-ahead depth — the `buffered(N)`
+read-ahead in `DedupService::stream_chunks` fed by
+`BlobStorageBackend::read_prefetch()`. Reproduces the exact production reassembly
+combinator (`stream::iter(hashes).map(get_blob_stream).buffered(N).try_flatten()`)
+over a real `LocalBlobBackend` whose chunk files are scattered across the 256
+hash-prefix dirs, then drains it and reports throughput. `N = 1` is the old
+production default ("antes"); higher N is the change ("después").
+
+## Reproduce
+
+```bash
+cargo run --release --features bench --example bench_blob_prefetch
+# tunables: BENCH_FILE_MB=192 BENCH_CHUNK_KB=256 BENCH_PREFETCH=1,2,4,8,16
+#   BENCH_THROTTLE_MBPS=0,300,100 BENCH_REPS=5 BENCH_COLD=1
+```
+
+## Results (4-core box, SSD-class storage, 192 MiB in 768×256 KiB chunks)
+
+Median MB/s over 5 reps; `vs N=1` is the read-ahead gain over the old default.
+
+| scenario                | N=1 |  N=2 |  N=4 |  N=8 | N=16 |
+|-------------------------|----:|-----:|-----:|-----:|-----:|
+| warm / unthrottled      |1306 | **1460** |1394 |1357 |1248 |
+| cold / unthrottled      | 456 | **489** | 469 | 478 | 473 |
+| warm / throttled@300MB/s| 167 | 166 | 166 | 166 | 166 |
+| cold / throttled@300MB/s| 138 | 140 | 135 | 135 | 136 |
+| warm / throttled@100MB/s|  62 |  62 |  62 |  62 |  62 |
+| cold / throttled@100MB/s|  57 |  58 |  57 |  57 |  57 |
+
+(`vs N=1` for the best column N=2: warm/unthrottled **+11.8 %**, cold/unthrottled
+**+7.2 %**; throttled rows ≈ 0 %. N=16 regresses warm −4.4 %.)
+
+## Conclusions
+
+1. **N=2 is the sweet spot, not 8.** It wins or ties in 5 of 6 scenarios at the
+   lowest fan-out: +11.8 % warm and +7.2 % cold on disk-bound reads, neutral when
+   the consumer is the bottleneck. N=8 gives only +3.9 %/+4.8 %; N=16 regresses.
+   So local now defaults to 2 (was 1); S3/Azure keep 8 (request-latency bound).
+
+2. **The win is disk-bound, not network-bound.** Throttled (network-bound) rows
+   are flat because `buffered(N)` here overlaps the per-chunk `File::open`
+   (cheap on local disk), **not** the data read (which `try_flatten` polls
+   sequentially). The disk-bound rows cover localhost/LAN downloads *and* the
+   internal blob reads that drain as fast as the disk delivers — thumbnail
+   render, transcode, ZIP export, content extraction — all via `stream_chunks`.
+
+3. **No cold regression on SSD.** The trait doc's "slower cold" worry (concurrent
+   opens → random I/O over scattered chunk files) is an HDD seek-thrash concern;
+   on SSD-class storage cold reads *improved* at N=2. Operators on spinning disks
+   can restore the old behaviour with `OXICLOUD_LOCAL_READ_PREFETCH=1`; NVMe
+   arrays can raise it.
@@ -0,0 +1,77 @@
+# CPU pool concurrency benchmark — thumbnail decode under a CPU quota
+
+Measures what `effective_parallelism()` changes for the image pools
+(`ThumbnailService::max_concurrent_decodes`, `image_transcode_service`, `di.rs`
+video ffmpeg fan-out): the number of concurrent CPU-heavy renders permitted.
+Those pools used to size from `available_parallelism()`, which ignores the CFS
+quota (`--cpus` / cgroup `cpu.max`), so under a container quota they permit one
+render per *host* core. Drives the **real service path** — `Semaphore(K)` gating
+`spawn_blocking(ThumbnailService::bench_render_all)` with a gallery of concurrent
+callers — and sweeps the permit count K, measuring throughput, p50/p99, and peak
+RSS for K concurrent decodes.
+
+## Reproduce
+
+```bash
+cargo build --release --features bench --example bench_pool_concurrency
+taskset -c 0,1 ./target/release/examples/bench_pool_concurrency   # model a 2-core quota
+# tunables: BENCH_K_LIST=1,2,4,8,16 BENCH_GALLERY=48 BENCH_SECONDS=4
+```
+
+## Results (4-core box, pinned to 2 cores; image: synthetic 48 MP JPEG)
+
+### [A] Throughput + tail latency (48 concurrent gallery callers)
+
+| permits | renders/s | p50 ms | p99 ms |
+|--------:|----------:|-------:|-------:|
+|       1 |      16.5 |   7342 |  10370 |
+|       2 (effective) | 20.0 | 5009 | 5816 |
+|       4 |      20.8 |   4895 |   5536 |
+|       8 |      20.0 |   4784 |   5685 |
+|      16 |      18.0 |   4576 |   6140 |
+
+### [B] Peak RSS, K concurrent decodes (one wave)
+
+| permits | peak RSS MiB |
+|--------:|-------------:|
+|       1 |          137 |
+|       2 |          137 |
+|       4 |          137 |
+|       8 |          137 |
+|      16 |          137 |
+
+## Conclusions
+
+1. **The thumbnail-decode pool is not a bottleneck — over-permitting costs
+   nothing measurable here.** Throughput is flat from K=2 to K=8 (CPU-bound: two
+   cores stay saturated regardless), p99 barely moves, and **peak RSS is flat at
+   137 MiB across K=1..16**. K=1 under-utilises (one decode can't fill two cores);
+   K=16 is marginally worse on throughput/p99. So sizing this pool to the CFS
+   quota neither gains nor loses on this workload.
+
+2. **This confirms the codebase's own design.** `thumbnail_service.rs` documents
+   that *shrink-on-load* (DCT-scaled decode straight to thumbnail size, ~18–25 MB
+   regardless of source resolution) is why the historical concurrency throttle
+   was removed — "the RAM ceiling no longer forces throttling and we can saturate
+   every core". The flat RSS is exactly that: each concurrent decode's transient
+   buffer is small, so 16 in flight cost the same resident memory as 1.
+
+3. **Decision: NOT migrated (reverted).** Because the only pool this bench could
+   isolate showed zero measured benefit, the `effective_parallelism()` migration
+   of the image pools was reverted — adding code without a measured win isn't
+   worth it. The `effective_parallelism()` helper stays (it has a *measured*
+   benefit in the Tokio runtime — see `RUNTIME`), so a future, deliberately
+   measured case can adopt it per-pool.
+   The one pool with a plausible a-priori argument is the **ffmpeg video
+   fan-out** (one heavyweight OS process per permit — 32 ffmpeg processes for a
+   2-core budget on a many-core host is self-evidently wasteful). That was left
+   on `available_parallelism()` too, to revisit *with* a measurement if a
+   high-host-core / low-quota deployment running video thumbnails ever warrants
+   it. The transcode rayon pool over-sizing only costs parked thread stacks
+   (negligible).
+
+4. **Honest caveat on scale.** This was run at a 2-core quota on a 4-core host
+   (K_oversub = 8 ≈ 4×). On a 64-core host under a 2-core quota the host-count
+   permit would be 64 (32× over), where even small per-decode costs and scheduler
+   pressure add up — the regime this change protects against but which this box
+   can't reproduce.
@@ -0,0 +1,93 @@
+# Tokio runtime tuning benchmark
+
+Measures the two things `build_runtime` (`src/main.rs`) changes versus the bare
+`#[tokio::main]` defaults, sized by `common::runtime::runtime_pool_sizes`:
+
+- **Worker count.** `#[tokio::main]` defaults to `available_parallelism()`, which
+  honours CPU *affinity* (`sched_getaffinity`: cpuset, `taskset`) but **ignores
+  the CFS bandwidth quota** (`docker --cpus`, cgroup v2 `cpu.max`, v1
+  `cpu.cfs_quota_us`). On a 2-core-quota container on a many-core host it spawns
+  one worker per *host* core. `effective_parallelism()` folds the quota back in.
+- **Blocking pool.** `#[tokio::main]` defaults to a flat `max_blocking_threads =
+  512` — a multi-GB RSS blast radius for this heavy `spawn_blocking` user
+  (thumbnails, transcode, zip, PDF/text extraction, Argon2 ≈19 MB/hash). The
+  builder caps it at `max(32, 8 × workers)`.
+
+## Reproduce
+
+```bash
+cargo build --release --features bench --example bench_tokio_runtime
+# Pin to 2 cores to model a 2-core CPU quota on a bigger host:
+taskset -c 0,1 ./target/release/examples/bench_tokio_runtime
+# Part B uses a fixed glibc mmap threshold for a clean RSS read:
+MALLOC_MMAP_THRESHOLD_=131072 MALLOC_TRIM_THRESHOLD_=131072 \
+  taskset -c 0,1 ./target/release/examples/bench_tokio_runtime
+# tunables: BENCH_CONCURRENCY=96 BENCH_SECONDS=4 BENCH_BURN_KB=256
+#   BENCH_WORKERS_BEFORE=32  BENCH_BLOCKING_TASKS=96 BENCH_ALLOC_MB=16 BENCH_MAX_BLOCKING_AFTER=16
+```
+
+## Results (4-core box, pinned to 2 cores via `taskset -c 0,1`)
+
+### [A] Worker over-subscription under CPU contention
+
+96 concurrent async "requests", each an async hop + a 256 KiB BLAKE3 (models a
+handler that interleaves I/O with on-worker compute), over 4 s.
+
+| runtime               |  req/s | p50 µs | p99 µs |
+|-----------------------|-------:|-------:|-------:|
+| before: 32 workers    | 46 854 |    121 | 60 360 |
+| after: 2 workers      | 42 893 |  2 140 |  4 962 |
+
+→ **throughput −8.5 %, p99 latency −91.8 %** (after vs before)
+
+### [B] Blocking-pool RSS blast radius
+
+96 concurrent `spawn_blocking` tasks, 16 MiB resident each, held 120 ms
+(fixed glibc mmap threshold so freed allocations leave RSS promptly).
+
+| max_blocking_threads        | peak RSS MiB | vs default |
+|-----------------------------|-------------:|-----------:|
+| before: 512 (tokio default) |        1 231 |          — |
+| after: 16 (bounded)         |          261 |   −970 MiB |
+
+## Conclusions
+
+1. **Blocking-pool cap — clear win, no downside.** Bounding 512→16 cut peak RSS
+   under a 96-task flood from **1231 MiB to 261 MiB (−970 MiB)**. The cap only
+   engages under a pile-up; steady-state operation is unaffected, and the app's
+   heaviest blocking consumers are already semaphore-limited (Argon2 = 2,
+   thumbnail decode ≈ cores), so `max(32, 8×workers)` is generous headroom that
+   simply removes the unbounded tail that can OOM-kill the process under a spike.
+
+2. **Worker sizing — a latency/throughput trade, favourable for a server.**
+   Over-subscription (32 workers on 2 cores, what tokio's default does under a
+   CFS quota) won **+8.5 % peak throughput** but at a **catastrophic p99 of
+   60 ms** (12× the tuned 5 ms) with a bimodal distribution — some requests fly
+   (p50 121 µs), others starve. Sizing to the quota (2 workers) gives uniform,
+   predictable latency at a small throughput cost. For an interactive file
+   server, p99 dominates UX (timeouts, head-of-line blocking), so this is the
+   right trade.
+
+3. **This microbenchmark is a worst case *for* the tuned config.** It is pure
+   on-worker CPU, which is exactly where over-subscription's throughput edge
+   shows. Real OxiCloud handlers push CPU to `spawn_blocking` and the async
+   workers mostly await I/O (DB, disk) — there the over-subscription throughput
+   edge evaporates (idle workers just park) while its tail-latency penalty
+   remains. Production should see the worker change as ≥ neutral on throughput
+   and strictly better on tail latency.
+
+4. **No regression off-quota.** `effective_parallelism()` == `available_
+   parallelism()` whenever there is no CFS quota (or affinity already restricts
+   the process), so on bare metal / affinity-pinned deployments the worker count
+   is unchanged from the old default. The change only bites under a CFS quota —
+   precisely the case it fixes.
+
+5. **Follow-up:** the same `available_parallelism()` blind spot affects the
+   image/rayon pools (`thumbnail_service.rs`, `image_transcode_service.rs`,
+   `di.rs` video) — they over-spawn under a CFS quota too. Switching those to
+   `common::runtime::effective_parallelism()` is the natural next step (left out
+   here to keep this change focused on the runtime).
+
+Both knobs are env-overridable (`OXICLOUD_WORKER_THREADS` /
+`OXICLOUD_MAX_BLOCKING_THREADS`) and logged at startup ("Tokio runtime pools
+sized"), so operators can see and tune what is in effect.