Log-store: narrow per-segment key pruning so reads don't need the full cap

🤖 Filed from investigation of the 2026-04-16 log-store outage (see \`/tmp/iris-outage/log_store_audit.md\`).

## Context

\`DuckDBLogStore\` segments hold logs for **every** user on the cluster. \`_LocalSegment.min_key\` / \`max_key\` are the Parquet row-group min/max of the \`key\` column, which currently spans roughly \`/alice/…\` → \`/zoe/…\` for every file. As a result, \`_segment_overlaps_key\` and \`_segment_overlaps_prefix\` rarely narrow anything — almost every segment is included in almost every read, regardless of whose logs the user asked for.

That forces the per-read working-set cap (\`_MAX_PARQUETS_PER_READ\` / \`_MAX_PARQUET_BYTES_PER_READ\`) to do the narrowing, which is what produced the outage: keys whose rows happened to live outside the newest N segments were invisible.

#4820 (this change) lifts the cap to 25 files / 2.5 GB and adds a newest-first early-stop loop. Benchmark results at 4.5 GB / 46 segments: p95 now sits at **~280-500 ms** for realistic tail queries and drops to ~30 ms when matches cluster in the newest segment. That's a workable ceiling, not a permanent answer — the underlying issue is that segment-level stats don't actually prune.

## Proposal: make \`min_key\` / \`max_key\` tight enough to prune

Options, roughly in order of implementation cost:

1. **Sort each flushed parquet by \`(user_prefix, key, seq)\`.** Segments already sort by \`(key, seq)\` at flush time (\`_flush_sealed_buffer\` line 653). The row-group min/max on \`key\` is still wide because a single file spans many users. If we instead wrote one parquet per user-prefix bucket — or at least partitioned row groups by user inside the file — row-group pruning on \`key\` would eliminate most segments for any exact-key or prefix query.

2. **Per-user sub-segments within one file.** Keep the single-file-per-flush layout but control row-group boundaries so each row group covers a contiguous user range. Parquet predicate pushdown already uses per-row-group min/max; this would cut segments scanned without changing the on-disk file count.

3. **Maintain an auxiliary per-segment bloom/roaring index on \`user_prefix\`.** Cheap in-memory side structure keyed by segment path. \`_segment_overlaps_key\` / \`_overlaps_prefix\` consult it instead of \`min_key\`/\`max_key\`. Update on flush/consolidation. Avoids changing the parquet layout.

4. **Route writes to per-user segments from the start.** Largest change — maintain separate rolling segments per user (or per bucket). Cleanest pruning semantics, biggest refactor; would also change GC, consolidation, and GCS layout.

Option 1 or 2 seems like the right first step — both preserve the single-store design and make segment-level pruning actually do its job, at which point the \`_MAX_PARQUETS_PER_READ\` cap becomes a safety net rather than load-bearing.

## Acceptance criteria

- \`_segment_overlaps_key\` rejects ≥90% of segments for a realistic exact-key query on the production corpus (~45 files).
- Tail p95 for \`get_logs(exact_key, max_lines=1000)\` stays <150 ms when no matches are in the newest segment but exist in older ones (currently ~300-700 ms with the cap-lifted early-stop path).
- No regression on mixed-write throughput or GCS offload volume.

## References

- \`lib/iris/src/iris/cluster/log_store/duckdb_store.py\` — \`_segment_overlaps_key\` (lines 190-194), \`_flush_sealed_buffer\` sort (lines 653, 666), \`_cap_segments\` (1011-1027).
- Outage audit: \`/tmp/iris-outage/log_store_audit.md\` (local).
- Benchmark: \`/tmp/iris-outage/bench.py\` (local).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Log-store: narrow per-segment key pruning so reads don't need the full cap #4833

Context

Proposal: make `min_key` / `max_key` tight enough to prune

Acceptance criteria

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Log-store: narrow per-segment key pruning so reads don't need the full cap #4833

Description

Context

Proposal: make `min_key` / `max_key` tight enough to prune

Acceptance criteria

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions