perf(blob): one manifest query in read_blob_bytes instead of two

DioCrafts · claude · DioCrafts · commit f68972e368d0 · 2026-06-22T00:01:56.000+02:00
read_blob_bytes read the same storage.chunk_manifests PK row twice per
full-blob read — blob_size (SELECT total_size) then read_blob_stream
(SELECT chunk_hashes) — even though both columns live in one row. Fold
them into a single `SELECT chunk_hashes, total_size` and share the chunk
stream builder via a new stream_chunks helper. The legacy (no-manifest)
path is unchanged. Output is identical; the read just costs one fewer DB
round-trip.

Benchmark (examples/bench_blob_manifest.rs, isolates the manifest lookup
against the real Postgres): ~1.9x throughput and p50/p99 roughly halved on
that sub-step; the win is the removed round-trip under pool pressure during
upload bursts. Note this is the manifest sub-step only — end-to-end
read_blob_bytes is dominated by the actual chunk reads, and it is a
background path (thumbnail generation / EXIF / indexing), not normal gallery
serving. Methodology + honest framing in benches/BLOB-MANIFEST.md.

Co-Authored-By: Claude Opus 4.8 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/Cargo.toml b/Cargo.toml
@@ -168,6 +168,12 @@ name = "bench_db_pool"
 path = "examples/bench_db_pool.rs"
 required-features = ["bench"]
 
+# Blob-manifest read round-trip benchmark: OLD 2 queries vs NEW 1 (needs dev Postgres).
+[[example]]
+name = "bench_blob_manifest"
+path = "examples/bench_blob_manifest.rs"
+required-features = ["bench"]
+
 # ACL owner-cache benchmark — owner query vs moka hit (needs the dev Postgres up).
 [[example]]
 name = "bench_owner_cache"
diff --git a/benches/BLOB-MANIFEST.md b/benches/BLOB-MANIFEST.md
@@ -0,0 +1,48 @@
+# Blob-manifest read — halve the manifest round-trips
+
+`DedupService::read_blob_bytes` (the full-blob read used by thumbnail generation,
+EXIF extraction, content indexing, etc.) used to read the **same**
+`storage.chunk_manifests` PK row **twice**:
+
+- `blob_size(hash)` → `SELECT total_size …` (for the buffer pre-allocation), then
+- `read_blob_stream(hash)` → `SELECT chunk_hashes …` (to stream the chunks).
+
+On the thumbnail cold path that's **2N manifest queries** for an N-image gallery
+load. The change folds both into one query and shares the chunk-stream builder:
+
+```sql
+SELECT chunk_hashes, total_size FROM storage.chunk_manifests WHERE file_hash = $1
+```
+
+(`dedup_service.rs` — `read_blob_bytes` + the extracted `stream_chunks` helper.)
+The legacy (no-manifest) path is unchanged.
+
+## Reproduce
+
+```bash
+cargo run --release --features bench --example bench_blob_manifest
+```
+
+Needs the dev Postgres up (reads `DATABASE_URL` from `.env`). The bench isolates
+exactly what changed — the manifest lookup(s) per blob read, not the unchanged
+chunk streaming — running OLD (2 queries/op) vs NEW (1 query/op) against the real
+pool, at low contention (raw per-op cost) and high contention (concurrency > pool,
+where holding a connection ~2× longer inflates the tail).
+
+## Results (pool=20, 4 s/run, 64-chunk manifest)
+
+| contention | mode | ops/s | p50 ms | p95 ms | p99 ms |
+|---|---|---:|---:|---:|---:|
+| conc 4 (no pool pressure) | OLD (2q) | 4 646 | 0.850 | 1.028 | 1.213 |
+| | **NEW (1q)** | **8 931** | **0.442** | **0.546** | **0.644** |
+| conc 64 (> pool 20) | OLD (2q) | 7 490 | 8.538 | 9.201 | 10.117 |
+| | **NEW (1q)** | **14 330** | **4.396** | **5.229** | **5.869** |
+
+- **~1.9× throughput** on the manifest-read step, **p50 and p99 roughly halved**.
+- Under pool pressure the absolute latency saved is larger (p50 8.5 → 4.4 ms),
+  because each OLD read occupies a connection for two round-trips instead of one —
+  exactly the tail-latency-under-contention win this targeted.
+
+The end-to-end gallery-load impact is smaller than 1.9× (chunk reads and decode
+dominate the full `read_blob_bytes`), but this removes one DB round-trip from
+*every* full-blob read, which is the part that queues under load.
diff --git a/examples/bench_blob_manifest.rs b/examples/bench_blob_manifest.rs
@@ -0,0 +1,230 @@
+//! Blob-manifest read benchmark — `read_blob_bytes` manifest round-trips.
+//!
+//! `read_blob_bytes` used to read the same `storage.chunk_manifests` PK row
+//! TWICE per full-blob read: once via `blob_size` (`SELECT total_size`) and once
+//! via `read_blob_stream` (`SELECT chunk_hashes`). On the thumbnail cold path
+//! that is 2N manifest queries for an N-image gallery load. The change folds both
+//! into ONE query (`SELECT chunk_hashes, total_size`).
+//!
+//! This isolates exactly that change — the two manifest lookups vs the one — and
+//! leaves the (unchanged) chunk streaming out, so the signal is the DB
+//! round-trip(s) per blob read. Runs OLD (2 queries) vs NEW (1 query) against the
+//! real dev Postgres, at low contention (raw per-op latency) and high contention
+//! (concurrency > pool, where holding a connection ~2× longer inflates the tail).
+//!
+//! Run (needs the dev Postgres up; reads DATABASE_URL from .env):
+//!   cargo run --release --features bench --example bench_blob_manifest
+//! Tunables (env): BENCH_POOL (20), BENCH_SECONDS (4), BENCH_CHUNKS (64),
+//!   BENCH_CONCURRENCIES ("4,64").
+
+use std::env;
+use std::sync::Arc;
+use std::time::{Duration, Instant};
+
+use sqlx::PgPool;
+use sqlx::postgres::PgPoolOptions;
+
+/// Synthetic manifest key: exactly 64 chars (the VARCHAR(64) PK width), and the
+/// non-hex letters ('n','s','h') guarantee it can never collide with a real
+/// BLAKE3 blob hash (always lowercase hex). 8 × "bench000".
+const FILE_HASH: &str = "bench000bench000bench000bench000bench000bench000bench000bench000";
+
+fn env_or<T: std::str::FromStr>(key: &str, default: T) -> T {
+    env::var(key)
+        .ok()
+        .and_then(|v| v.parse().ok())
+        .unwrap_or(default)
+}
+
+#[derive(Clone, Copy)]
+enum Mode {
+    /// Two separate manifest lookups (the old blob_size + read_blob_stream).
+    Old,
+    /// One combined manifest lookup (the new read_blob_bytes).
+    New,
+}
+
+async fn seed(pool: &PgPool, n_chunks: usize) {
+    let chunk_hashes: Vec<String> = (0..n_chunks).map(|i| format!("{i:064x}")).collect();
+    let chunk_sizes: Vec<i64> = vec![65_536; n_chunks];
+    let total: i64 = chunk_sizes.iter().sum();
+    sqlx::query(
+        "INSERT INTO storage.chunk_manifests
+             (file_hash, chunk_hashes, chunk_sizes, total_size, chunk_count)
+         VALUES ($1, $2, $3, $4, $5)
+         ON CONFLICT (file_hash) DO UPDATE
+             SET chunk_hashes = $2, chunk_sizes = $3, total_size = $4, chunk_count = $5",
+    )
+    .bind(FILE_HASH)
+    .bind(&chunk_hashes)
+    .bind(&chunk_sizes)
+    .bind(total)
+    .bind(n_chunks as i32)
+    .execute(pool)
+    .await
+    .expect("seed chunk_manifests row");
+}
+
+async fn cleanup(pool: &PgPool) {
+    let _ = sqlx::query("DELETE FROM storage.chunk_manifests WHERE file_hash = $1")
+        .bind(FILE_HASH)
+        .execute(pool)
+        .await;
+}
+
+/// One blob "manifest read" — exactly the queries the production code issues.
+async fn one_op(pool: &PgPool, mode: Mode) {
+    match mode {
+        Mode::Old => {
+            let _total: i64 = sqlx::query_scalar(
+                "SELECT total_size FROM storage.chunk_manifests WHERE file_hash = $1",
+            )
+            .bind(FILE_HASH)
+            .fetch_one(pool)
+            .await
+            .expect("old total_size query");
+            let _chunks: Vec<String> = sqlx::query_scalar(
+                "SELECT chunk_hashes FROM storage.chunk_manifests WHERE file_hash = $1",
+            )
+            .bind(FILE_HASH)
+            .fetch_one(pool)
+            .await
+            .expect("old chunk_hashes query");
+        }
+        Mode::New => {
+            let _row: (Vec<String>, i64) = sqlx::query_as(
+                "SELECT chunk_hashes, total_size FROM storage.chunk_manifests WHERE file_hash = $1",
+            )
+            .bind(FILE_HASH)
+            .fetch_one(pool)
+            .await
+            .expect("new combined query");
+        }
+    }
+}
+
+struct Stats {
+    count: usize,
+    rps: f64,
+    p50: f64,
+    p95: f64,
+    p99: f64,
+    max: f64,
+}
+
+fn summarize(mut lats: Vec<f64>, secs: u64) -> Stats {
+    lats.sort_by(|a, b| a.partial_cmp(b).unwrap());
+    let n = lats.len();
+    let pct = |p: f64| {
+        if n == 0 {
+            0.0
+        } else {
+            lats[((n as f64 * p) as usize).min(n - 1)]
+        }
+    };
+    Stats {
+        count: n,
+        rps: n as f64 / secs as f64,
+        p50: pct(0.50),
+        p95: pct(0.95),
+        p99: pct(0.99),
+        max: lats.last().copied().unwrap_or(0.0),
+    }
+}
+
+async fn run_window(pool: Arc<PgPool>, concurrency: usize, secs: u64, mode: Mode) -> Stats {
+    let deadline = Instant::now() + Duration::from_secs(secs);
+    let mut handles = Vec::with_capacity(concurrency);
+    for _ in 0..concurrency {
+        let pool = pool.clone();
+        handles.push(tokio::spawn(async move {
+            let mut lats = Vec::new();
+            while Instant::now() < deadline {
+                let t = Instant::now();
+                one_op(&pool, mode).await;
+                lats.push(t.elapsed().as_secs_f64() * 1000.0);
+            }
+            lats
+        }));
+    }
+    let mut all = Vec::new();
+    for h in handles {
+        all.extend(h.await.unwrap());
+    }
+    summarize(all, secs)
+}
+
+#[tokio::main(flavor = "multi_thread")]
+async fn main() {
+    dotenvy::dotenv().ok();
+    let url = env::var("DATABASE_URL")
+        .or_else(|_| env::var("OXICLOUD_DB_CONNECTION_STRING"))
+        .expect("set DATABASE_URL (or OXICLOUD_DB_CONNECTION_STRING) — the dev Postgres URL");
+
+    let pool_size: u32 = env_or("BENCH_POOL", 20);
+    let secs: u64 = env_or("BENCH_SECONDS", 4);
+    let n_chunks: usize = env_or("BENCH_CHUNKS", 64);
+    let concurrencies: Vec<usize> = env::var("BENCH_CONCURRENCIES")
+        .ok()
+        .map(|s| s.split(',').filter_map(|x| x.trim().parse().ok()).collect())
+        .unwrap_or_else(|| vec![4, 64]);
+
+    let pool = Arc::new(
+        PgPoolOptions::new()
+            .max_connections(pool_size)
+            .min_connections(pool_size) // pre-warm: don't time connection setup
+            .acquire_timeout(Duration::from_secs(10))
+            .connect(&url)
+            .await
+            .expect("connect dev Postgres"),
+    );
+
+    seed(&pool, n_chunks).await;
+
+    println!("\n###########################################################");
+    println!("# read_blob_bytes manifest round-trips: OLD (2 queries) vs NEW (1)");
+    println!("# pool={pool_size}  window={secs}s/run  chunks/manifest={n_chunks}");
+    println!("# latency = acquire-wait + manifest query/queries per blob read");
+    println!("###########################################################\n");
+    println!(
+        "| {:>5} | {:<4} | {:>9} | {:>9} | {:>7} | {:>7} | {:>7} | {:>7} |",
+        "conc", "mode", "ops", "ops/s", "p50 ms", "p95 ms", "p99 ms", "max ms"
+    );
+    println!(
+        "|{:-<7}|{:-<6}|{:-<11}|{:-<11}|{:-<9}|{:-<9}|{:-<9}|{:-<9}|",
+        "", "", "", "", "", "", "", ""
+    );
+
+    for &conc in &concurrencies {
+        // Warm-up (discarded) so the first real window isn't skewed.
+        let _ = run_window(pool.clone(), conc, 1, Mode::New).await;
+
+        let old = run_window(pool.clone(), conc, secs, Mode::Old).await;
+        let new = run_window(pool.clone(), conc, secs, Mode::New).await;
+        let row = |label: &str, s: &Stats| {
+            println!(
+                "| {:>5} | {:<4} | {:>9} | {:>9.0} | {:>7.3} | {:>7.3} | {:>7.3} | {:>7.3} |",
+                conc, label, s.count, s.rps, s.p50, s.p95, s.p99, s.max
+            );
+        };
+        row("OLD", &old);
+        row("NEW", &new);
+        let thr = if old.rps > 0.0 {
+            new.rps / old.rps
+        } else {
+            0.0
+        };
+        let p99 = if new.p99 > 0.0 {
+            old.p99 / new.p99
+        } else {
+            0.0
+        };
+        println!(
+            "|       | →    | {:>9} | {:>7.2}× | {:>7} | {:>7} | {:>6.2}× | {:>7} |",
+            "throughput", thr, "", "", p99, ""
+        );
+    }
+
+    cleanup(&pool).await;
+    println!("\n(ops = blob-manifest reads completed; NEW issues 1 query/op, OLD issues 2.)");
+}
diff --git a/src/infrastructure/services/dedup_service.rs b/src/infrastructure/services/dedup_service.rs
@@ -1475,6 +1475,33 @@ impl DedupService {
 
     // ── Read operations ──────────────────────────────────────────
 
+    /// Build an in-order, prefetched byte stream over a CDC file's chunks.
+    ///
+    /// Read-ahead depth is the backend's hint (1 for local disk, higher for
+    /// remote object stores where overlapping fetches hide per-chunk latency).
+    /// Shared by [`Self::read_blob_stream`] and [`Self::read_blob_bytes`] so both
+    /// build the chunk stream identically from a manifest's `chunk_hashes`.
+    fn stream_chunks(
+        &self,
+        chunk_hashes: Vec<String>,
+    ) -> Pin<Box<dyn Stream<Item = Result<Bytes, std::io::Error>> + Send>> {
+        let prefetch = self.backend.read_prefetch().max(1);
+        let backend = self.backend.clone();
+        let chunk_stream = stream::iter(chunk_hashes)
+            .map(move |chunk_hash| {
+                let backend = backend.clone();
+                async move {
+                    backend
+                        .get_blob_stream(&chunk_hash)
+                        .await
+                        .map_err(|e| std::io::Error::other(e.to_string()))
+                }
+            })
+            .buffered(prefetch)
+            .try_flatten();
+        Box::pin(chunk_stream)
+    }
+
     /// Stream blob content — CDC-aware with legacy fallback.
     ///
     /// For CDC files: looks up the manifest, then streams chunks in order,
@@ -1494,29 +1521,10 @@ impl DedupService {
         .await
         .map_err(|e| DomainError::internal_error("Dedup", format!("Manifest lookup: {}", e)))?;
 
-        if let Some(chunk_hashes) = manifest {
-            // CDC file: stream chunks in order. Read-ahead depth is the
-            // backend's hint (1 for local disk, higher for remote object
-            // stores where overlapping fetches hide per-chunk latency).
-            let prefetch = self.backend.read_prefetch().max(1);
-            let backend = self.backend.clone();
-            let chunk_stream = stream::iter(chunk_hashes)
-                .map(move |chunk_hash| {
-                    let backend = backend.clone();
-                    async move {
-                        backend
-                            .get_blob_stream(&chunk_hash)
-                            .await
-                            .map_err(|e| std::io::Error::other(e.to_string()))
-                    }
-                })
-                .buffered(prefetch)
-                .try_flatten();
-
-            Ok(Box::pin(chunk_stream))
-        } else {
+        match manifest {
+            Some(chunk_hashes) => Ok(self.stream_chunks(chunk_hashes)),
             // Legacy whole-file blob
-            self.backend.get_blob_stream(hash).await
+            None => self.backend.get_blob_stream(hash).await,
         }
     }
 
@@ -1525,11 +1533,33 @@ impl DedupService {
     /// This is intended for image-oriented workflows such as thumbnail
     /// generation where the downstream library already requires the full
     /// payload in memory to decode the image.
+    ///
+    /// A single manifest query fetches BOTH the size hint (for the buffer
+    /// pre-allocation) and the chunk list — they live in the same
+    /// `chunk_manifests` PK row, so reading them separately (the old
+    /// `blob_size` + `read_blob_stream`) doubled the manifest round-trips on
+    /// every full-blob read (e.g. 2N queries for an N-image gallery cold load).
     pub async fn read_blob_bytes(&self, hash: &str) -> Result<Bytes, DomainError> {
-        let expected_size = self.blob_size(hash).await? as usize;
-        let mut data = Vec::with_capacity(expected_size);
-        let mut stream = self.read_blob_stream(hash).await?;
+        let manifest = sqlx::query_as::<_, (Vec<String>, i64)>(
+            "SELECT chunk_hashes, total_size FROM storage.chunk_manifests WHERE file_hash = $1",
+        )
+        .bind(hash)
+        .fetch_optional(self.pool.as_ref())
+        .await
+        .map_err(|e| DomainError::internal_error("Dedup", format!("Manifest lookup: {}", e)))?;
+
+        let (mut stream, expected_size) = match manifest {
+            Some((chunk_hashes, total_size)) => {
+                (self.stream_chunks(chunk_hashes), total_size.max(0) as usize)
+            }
+            None => {
+                // Legacy whole-file blob: size + stream straight from the backend.
+                let size = self.backend.blob_size(hash).await? as usize;
+                (self.backend.get_blob_stream(hash).await?, size)
+            }
+        };
 
+        let mut data = Vec::with_capacity(expected_size);
         while let Some(chunk) = stream.next().await {
             let chunk = chunk.map_err(|e| {
                 DomainError::internal_error("Dedup", format!("Failed to read blob chunk: {}", e))