Skip to content

Commit fa13377

Browse files
docs(adr): ADR-014 — Disaggregated Compute/Storage (Proposed)
Pin die Architektur-Entscheidungen fuer G21 vor Code-Zeile 1: - Zwei parallele storage.mode-Werte: disaggregated-wal (AutoMQ-Style, sub-10ms-Latency, WAL+S3-Offload) und disaggregated-stateless (WarpStream-Style, ~500ms-Latency, RAM-Buffer + S3-direct). Default bleibt replicated. Per-Topic opt-in, nicht cluster-wide. - Per-Partition-Manifest StreamObjectRef[] persistiert in der bestehenden Cluster-Metadata (KRaft-Backed). Kein separater Metadata-Service in v1. - Wire-Protocol asymmetrisch: Native-Clients bekommen produce-strategy Hint in MetadataResponse (3 Strategien), Kafka-wire-Clients reden weiter normal mit dem Broker, der als Relay-Agent fungiert. - replication.factor = 1 fuer disaggregated-Topics erzwungen (S3 ist die Durability-Quelle, ISR-Replikation wuerde Kosten verdoppeln). - Out of v1: Compaction in disaggregated-Mode, Transaktionen ueber disaggregated-Topics, mid-life storage.mode-Wechsel. P0 abgeschlossen, P1 (Topic-Config-Plumbing) ist next.
1 parent a8ae9cc commit fa13377

2 files changed

Lines changed: 97 additions & 0 deletions

File tree

Lines changed: 96 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,96 @@
1+
# ADR-014: Disaggregated Compute/Storage — Two Modes
2+
3+
| Status | Date |
4+
|----------|------------|
5+
| Proposed | 2026-06-11 |
6+
7+
## Context
8+
9+
Today every Surgewave topic uses the same write path: producer → broker leader → ISR replication to N-1 followers → local segment files → optional cold-tier offload to S3 via `Storage.Tiering`. The `Storage.Engine.S3` exists but is used as a primary-storage *backend* — the broker still owns the hot path, the data still flows broker-to-broker over the network for replication, and the durability story is "trust the replicas".
10+
11+
That layout is the right default for low-latency in-cluster workloads. It is the wrong default for two newer workload shapes that are now part of Surgewave's positioning:
12+
13+
1. **Cloud-cost-sensitive workloads.** Cross-AZ replication traffic dominates the bill on AWS / Azure. A 3× replication-factor topic costs roughly 3× the egress + 3× the storage of the same topic stored once on S3. Operators running analytical or audit-log pipelines do not need the latency that justifies that cost.
14+
2. **Disaggregated cloud-native deployments.** The wider broker market (WarpStream, AutoMQ, Confluent Kora, Bufstream) has converged on a pattern where the broker becomes a coordination/metadata layer and the bytes live on object storage. The durability story shifts from "ISR replicas" to "S3 already replicates internally (11×9 durability)".
15+
16+
Surgewave needs to support these workloads without losing the sub-10 ms produce-latency story it already has on the classical path. A single "disaggregated" mode cannot meet both targets — there are two distinct sweet spots and they have incompatible latency profiles:
17+
18+
- **WAL + S3-offload (AutoMQ-style).** Broker keeps a local WAL on EBS/NVMe so produce-ack stays sub-10 ms; a background job compacts the WAL into S3 stream objects on a short interval (seconds). No ISR. S3 is the durable store after offload. Embedded-friendly because the WAL works locally even without S3 configured.
19+
- **Stateless agents + S3-direct (WarpStream-style).** Producer hits a stateless agent process. Agent buffers the batch in RAM, periodically PUTs to S3, then commits the offset range to a metadata cell. No WAL, no per-partition leader, no embedded mode. Produce-P99 is dominated by the S3 PUT (~400-600 ms). Horizontal scaling is trivial because agents are interchangeable.
20+
21+
Both modes share the same two coordination requirements: a per-partition manifest mapping `[offset_lo, offset_hi]` to object-store keys, and an offset-commit protocol that admits new manifest entries atomically.
22+
23+
## Decision
24+
25+
Surgewave introduces **two parallel disaggregated storage modes**, both selected via a per-topic config property `storage.mode`. The existing replicated path stays the default and is unchanged for topics that do not opt in.
26+
27+
| `storage.mode` | Engine | Latency target | Durability source |
28+
|----------------------------|--------------------------------------------------------|----------------|-------------------|
29+
| `replicated` *(default)* | Existing local-segment + ISR path | Sub-10 ms P99 | ISR replicas |
30+
| `disaggregated-wal` | Local WAL → background flush to S3 stream objects | Sub-10 ms P99 | S3 after offload |
31+
| `disaggregated-stateless` | Stateless agent → RAM buffer → direct S3 PUT | ~400-600 ms P99 | S3 directly |
32+
33+
### Concrete consequences
34+
35+
1. **Topic-level config.** `storage.mode` is added as a topic property in `TopicConfig`. Validation rejects unknown values and combinations the broker cannot satisfy in the current cluster (e.g. `disaggregated-*` without a configured object-store endpoint). Default for new topics is `replicated`. The property is *not* alterable mid-life in v1: a topic chooses its mode at create time. (Alter-config across modes ships in a later iteration once segment-migration is built.)
36+
37+
2. **Per-partition manifest.** Both disaggregated modes maintain a manifest per `TopicPartition` consisting of an ordered list of `StreamObjectRef { ObjectKey, FirstOffset, LastOffset, BytesOnDisk, CreatedAt }`. The manifest persists in the existing cluster-metadata store (`Clustering` package, KRaft-backed today). No separate metadata service in v1 — that is a scale follow-up.
38+
39+
3. **No ISR for disaggregated topics.** Topics in `disaggregated-*` mode have `replication.factor = 1` enforced at create-time. The broker still tracks a single owning broker for coordination (offset assignment, consumer-group coordination), but partition-level replicas are explicitly forbidden — durability comes from S3, replicating again wastes money. Consumer-group state stays on the broker the same way it does today.
40+
41+
4. **Wire-protocol behaviour.** The decision is asymmetric by client capability:
42+
- **Native protocol clients** receive a per-topic "produce strategy" hint inside the `MetadataResponse`. Three strategies: `replicated` (existing path, ack via leader), `wal-via-broker` (acks-via-leader, broker writes WAL + schedules flush), `stateless-direct` (broker hands out a presigned PUT URL per batch, client uploads, commits via a follow-up `CommitWriteRequest`).
43+
- **Kafka wire clients** always look like normal Kafka producers from the outside — they cannot consume presigned URLs and we cannot extend the wire protocol. The broker accepts Produce requests for `disaggregated-stateless` topics, buffers them in-process (the broker becomes a relay agent for these clients), and follows the same S3-PUT → commit path. Cost-win is preserved (no ISR); the latency profile of `disaggregated-stateless` over Kafka-wire is identical to the native-direct path because both end up doing one S3 PUT per buffered batch.
44+
45+
5. **Failure modes.** Documented in the user-facing operations guide:
46+
- `disaggregated-wal`: a WAL-mode broker crashing before its scheduled S3 flush loses no data, because WAL recovery on restart re-flushes any pending stream object. A WAL-mode broker losing its *disk* loses the un-flushed window (typically seconds). Operators who cannot accept that window must keep `replicated` topics.
47+
- `disaggregated-stateless`: an agent crashing with un-PUT batches loses those batches. `acks=all` semantics means the client only sees the ack after the S3 PUT returned 200; so a crashed agent never reports false durability. Producers retry as normal.
48+
- Object store unreachable: produce blocks for `disaggregated-*` topics, succeeds normally for `replicated` topics. The broker surfaces this as `STORAGE_UNAVAILABLE` instead of stalling silently.
49+
50+
6. **Embedded-mode story.**
51+
- `replicated`: unchanged, in-process broker.
52+
- `disaggregated-wal`: supported. WAL is the embedded story by default; S3 offload becomes a no-op when no object store is configured. Embedded topic can be promoted later by simply attaching a bucket.
53+
- `disaggregated-stateless`: **not supported in embedded mode** — an in-process broker with no S3 reachable cannot satisfy the contract. Embedded `Build()` rejects this mode at startup with a clear error message.
54+
55+
7. **Object-store abstraction.** `Storage.Tiering.IRemoteStorageProvider` is the contract both modes use for S3-equivalent backends. The S3-AWS provider already exists; an Azure-Blob + a GCS provider land as separate ADRs/PRs (out of G21's scope but consume the same interface).
56+
57+
### What stays out of v1
58+
59+
- **Compacted disaggregated topics.** Log compaction in disaggregated mode requires rewriting stream objects, which is doable but adds significant complexity. v1 supports retention-based deletion only; producers using compaction get a config-validation error if they try to combine it with `disaggregated-*`.
60+
- **Cross-topic transactions on disaggregated topics.** Producer transactions stay restricted to `replicated` topics in v1. A transaction touching a `disaggregated-*` topic is rejected at the coordinator with `INVALID_TXN_STATE`. Lifting this requires a two-phase-commit story over S3 PUTs that is its own ADR.
61+
- **Mid-life alter-config across modes.** Once a topic is created with a mode, that mode is fixed for v1. Switching means recreate-with-mirror.
62+
63+
## Alternatives Considered
64+
65+
- **Single disaggregated mode (one or the other).**
66+
- WAL-only: keeps sub-10 ms latency, but operators with cost-first batch workloads pay for EBS volumes they do not need. Surrenders the "WarpStream-cheap" use case.
67+
- Stateless-only: maximum cost-win, but Surgewave's sub-10 ms latency story dies — embedded mode becomes impossible, and the broker positioning shifts to a slower batch-broker. Rejected.
68+
- **`storage.mode` as a cluster-wide setting instead of per-topic.** Simpler in code but defeats the actual operator need: different workloads on the same cluster want different tradeoffs. Per-topic is the conventional path (Confluent, AutoMQ, WarpStream all gate at topic-level).
69+
- **External metadata service from day one (WarpStream "cells").** Would let many stateless agents share a horizontally-scaled metadata store. We don't need that scale yet — the existing broker metadata handles it for ≥100k partitions per broker in our benchmarks. Postponed; the manifest design is forward-compatible.
70+
- **Reuse `Storage.Engine.S3` as the disaggregated backend directly.** That engine writes one S3 object per *segment-flush*, designed for the broker-owned hot path. The disaggregated modes need a different write rhythm (WAL-flush batches, stateless agent buffer-flushes) and a different read path (manifest-driven, not segment-driven). Sharing the low-level S3 client + buffer pool is appropriate; sharing the `ISurgewaveStorageEngine` implementation would force-fit two different write contracts. Disaggregated gets a new `StreamObjectStore` abstraction layered on top of the existing remote-provider.
71+
72+
## Implementation Plan
73+
74+
Each phase ships its own commits and is independently testable. The order matters: configuration plumbing must land before any write-path change.
75+
76+
- **P0 — ADR (this document).** Pin the architectural decisions for review.
77+
- **P1 — Topic-config + metadata-layer skeleton.** `storage.mode` as a topic property, schema-validated, persisted in cluster metadata, surfaced in Control UI. `StreamObjectRef` + `PartitionManifest` types under a new `Kuestenlogik.Surgewave.Storage.Disaggregated` project. No write-path change yet.
78+
- **P2 — `disaggregated-wal` mode.** WAL writer reuses the existing `FileLogSegment`; a `WalFlusher` periodically reads sealed segments, packs them into a `StreamObject` (a concatenated, indexed S3 PUT payload), uploads via `IRemoteStorageProvider`, and appends the resulting `StreamObjectRef` to the partition manifest. Read path becomes: serve from WAL until the requested offset is flushed, then serve from the manifest's stream objects.
79+
- **P3 — `disaggregated-stateless` mode.** Agent-style write path: incoming Produce batches buffer in a `StatelessAgentBuffer`, the buffer flushes on size/time threshold by PUTting to S3 and atomically appending to the manifest. No WAL. Reads are manifest-driven only.
80+
- **P4 — Native-client integration.** `SurgewaveNativeClient` learns the three produce strategies via `MetadataResponse`; the client picks the right one per partition. Pre-existing `replicated` clients see no change.
81+
- **P5 — Kafka-wire relay fallback.** Broker accepts `Produce` for disaggregated topics from classical Kafka clients, runs the same buffer→PUT→commit flow internally. From the client's POV the API is unchanged.
82+
- **P6 — Tests + cost-bench + docs.** Integration tests for each mode (failure injection: agent crash, S3 timeout, mid-flush kill), a cost-comparison run in the public benchmark suite (`disaggregated-*` vs. `replicated` over identical workload, with $/GB-month + cross-AZ-bytes numbers), `docs/storage/disaggregated.md` operator guide.
83+
84+
## Consequences
85+
86+
- Operators get an explicit cost/latency knob at topic granularity. The default behaviour (sub-10 ms latency, full ISR durability) does not change for any existing deployment.
87+
- The broker grows two new internal write paths. The classical path stays the simplest and fastest; disaggregated paths add complexity but are isolated behind the `storage.mode` switch.
88+
- Surgewave's positioning explicitly covers three workload archetypes after this lands: real-time (replicated), real-time-cheap (disaggregated-wal), batch-cheap (disaggregated-stateless). Documentation must guide operators to the right knob, not present three equivalent options.
89+
- Future enterprise extensions (replication geo-fence, schema-aware compaction) need to grow disaggregated-aware variants. v1 enforces "disaggregated topics participate in fewer enterprise features"; that boundary should narrow over later releases, not widen.
90+
91+
## See also
92+
93+
- [G21 — Disaggregated compute/storage mode](https://github.com/Kuestenlogik/Surgewave/issues/11) — the roadmap entry this ADR implements.
94+
- ADR-001 — Pluggable Storage Engine Abstraction (the `ISurgewaveStorageEngine` contract this builds *alongside*, not *on top of*).
95+
- ADR-012 — Zero-Copy & High-Performance Patterns (the WAL flusher's stream-object packer must keep the existing zero-copy invariants on the read path).
96+
- `Kuestenlogik.Surgewave.Storage.Tiering.IRemoteStorageProvider` — object-store contract reused by both new modes.

docs/adr/README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21,3 +21,4 @@ ADRs document important architectural decisions, their context, and consequences
2121
| [ADR-011](011-community-enterprise-split.md) | Community/Enterprise Repository Split | Accepted | 2026-04 |
2222
| [ADR-012](012-zero-copy-high-performance.md) | Zero-Copy & High-Performance Patterns | Accepted | 2026-04 |
2323
| [ADR-013](013-control-ui-plugin-first.md) | Control UI Plugin-First Architecture | Accepted | 2026-05 |
24+
| [ADR-014](014-disaggregated-storage.md) | Disaggregated Compute/Storage — Two Modes | Proposed | 2026-06 |

0 commit comments

Comments
 (0)