🤖 Filed from investigation of the 2026-04-16 log-store outage (see `/tmp/iris-outage/log_store_audit.md`).
Context
`DuckDBLogStore` segments hold logs for every user on the cluster. `_LocalSegment.min_key` / `max_key` are the Parquet row-group min/max of the `key` column, which currently spans roughly `/alice/…` → `/zoe/…` for every file. As a result, `_segment_overlaps_key` and `_segment_overlaps_prefix` rarely narrow anything — almost every segment is included in almost every read, regardless of whose logs the user asked for.
That forces the per-read working-set cap (`_MAX_PARQUETS_PER_READ` / `_MAX_PARQUET_BYTES_PER_READ`) to do the narrowing, which is what produced the outage: keys whose rows happened to live outside the newest N segments were invisible.
#4820 (this change) lifts the cap to 25 files / 2.5 GB and adds a newest-first early-stop loop. Benchmark results at 4.5 GB / 46 segments: p95 now sits at ~280-500 ms for realistic tail queries and drops to ~30 ms when matches cluster in the newest segment. That's a workable ceiling, not a permanent answer — the underlying issue is that segment-level stats don't actually prune.
Proposal: make `min_key` / `max_key` tight enough to prune
Options, roughly in order of implementation cost:
-
Sort each flushed parquet by `(user_prefix, key, seq)`. Segments already sort by `(key, seq)` at flush time (`_flush_sealed_buffer` line 653). The row-group min/max on `key` is still wide because a single file spans many users. If we instead wrote one parquet per user-prefix bucket — or at least partitioned row groups by user inside the file — row-group pruning on `key` would eliminate most segments for any exact-key or prefix query.
-
Per-user sub-segments within one file. Keep the single-file-per-flush layout but control row-group boundaries so each row group covers a contiguous user range. Parquet predicate pushdown already uses per-row-group min/max; this would cut segments scanned without changing the on-disk file count.
-
Maintain an auxiliary per-segment bloom/roaring index on `user_prefix`. Cheap in-memory side structure keyed by segment path. `_segment_overlaps_key` / `_overlaps_prefix` consult it instead of `min_key`/`max_key`. Update on flush/consolidation. Avoids changing the parquet layout.
-
Route writes to per-user segments from the start. Largest change — maintain separate rolling segments per user (or per bucket). Cleanest pruning semantics, biggest refactor; would also change GC, consolidation, and GCS layout.
Option 1 or 2 seems like the right first step — both preserve the single-store design and make segment-level pruning actually do its job, at which point the `_MAX_PARQUETS_PER_READ` cap becomes a safety net rather than load-bearing.
Acceptance criteria
- `_segment_overlaps_key` rejects ≥90% of segments for a realistic exact-key query on the production corpus (~45 files).
- Tail p95 for `get_logs(exact_key, max_lines=1000)` stays <150 ms when no matches are in the newest segment but exist in older ones (currently ~300-700 ms with the cap-lifted early-stop path).
- No regression on mixed-write throughput or GCS offload volume.
References
- `lib/iris/src/iris/cluster/log_store/duckdb_store.py` — `_segment_overlaps_key` (lines 190-194), `_flush_sealed_buffer` sort (lines 653, 666), `_cap_segments` (1011-1027).
- Outage audit: `/tmp/iris-outage/log_store_audit.md` (local).
- Benchmark: `/tmp/iris-outage/bench.py` (local).
🤖 Filed from investigation of the 2026-04-16 log-store outage (see `/tmp/iris-outage/log_store_audit.md`).
Context
`DuckDBLogStore` segments hold logs for every user on the cluster. `_LocalSegment.min_key` / `max_key` are the Parquet row-group min/max of the `key` column, which currently spans roughly `/alice/…` → `/zoe/…` for every file. As a result, `_segment_overlaps_key` and `_segment_overlaps_prefix` rarely narrow anything — almost every segment is included in almost every read, regardless of whose logs the user asked for.
That forces the per-read working-set cap (`_MAX_PARQUETS_PER_READ` / `_MAX_PARQUET_BYTES_PER_READ`) to do the narrowing, which is what produced the outage: keys whose rows happened to live outside the newest N segments were invisible.
#4820 (this change) lifts the cap to 25 files / 2.5 GB and adds a newest-first early-stop loop. Benchmark results at 4.5 GB / 46 segments: p95 now sits at ~280-500 ms for realistic tail queries and drops to ~30 ms when matches cluster in the newest segment. That's a workable ceiling, not a permanent answer — the underlying issue is that segment-level stats don't actually prune.
Proposal: make `min_key` / `max_key` tight enough to prune
Options, roughly in order of implementation cost:
Sort each flushed parquet by `(user_prefix, key, seq)`. Segments already sort by `(key, seq)` at flush time (`_flush_sealed_buffer` line 653). The row-group min/max on `key` is still wide because a single file spans many users. If we instead wrote one parquet per user-prefix bucket — or at least partitioned row groups by user inside the file — row-group pruning on `key` would eliminate most segments for any exact-key or prefix query.
Per-user sub-segments within one file. Keep the single-file-per-flush layout but control row-group boundaries so each row group covers a contiguous user range. Parquet predicate pushdown already uses per-row-group min/max; this would cut segments scanned without changing the on-disk file count.
Maintain an auxiliary per-segment bloom/roaring index on `user_prefix`. Cheap in-memory side structure keyed by segment path. `_segment_overlaps_key` / `_overlaps_prefix` consult it instead of `min_key`/`max_key`. Update on flush/consolidation. Avoids changing the parquet layout.
Route writes to per-user segments from the start. Largest change — maintain separate rolling segments per user (or per bucket). Cleanest pruning semantics, biggest refactor; would also change GC, consolidation, and GCS layout.
Option 1 or 2 seems like the right first step — both preserve the single-store design and make segment-level pruning actually do its job, at which point the `_MAX_PARQUETS_PER_READ` cap becomes a safety net rather than load-bearing.
Acceptance criteria
References