-
Notifications
You must be signed in to change notification settings - Fork 40
Description
Problem / Motivation
We currently lack a reliable way to measure replication + availability latency across canonical nodes: specifically, the time between when a write is confirmed by one node and when that same payload becomes observable via stream on the rest of the canonical node set.
This latency is important because it directly impacts:
- perceived end-to-end UX (message appears “sent” but not visible elsewhere),
- cross-node consistency expectations,
- debugging incidents where some nodes lag or fail to replicate,
- regression detection when node releases or infra changes affect propagation.
Right now, we can infer pieces of this latency indirectly (logs, manual testing, tracing), but we cannot continuously measure it in a standardized and automated way across environments, nor do we have a tool that can run as a containerized probe and export time series metrics to Grafana/Prometheus.
Proposed Utility: “Write-to-Stream Propagation Latency Probe”
We need a utility tool (either part of the CLI or standalone) that can run as a test container and produce automated measurements for Grafana.
Core Measurement Definition
Write-to-Stream Propagation Latency (per canonical node)
The time between the write being acknowledged/confirmed by the writer node and the message payload being observable on each canonical node’s stream for the topic.
This must be measured per canonical node, since replication/availability is not uniform and node-specific regressions matter.
High-Level Flow
The tool should run the following steps in a loop (or at a configured interval, or until completion and then exit):
-
Fetch canonical nodes
-
Pick a deterministic topic
- Must be predictable and consistent across runs for easy correlation, indexing, and SQL lookup.
- Example: latency_probe/// or a fixed topic with a unique subtopic key.
- The goal is repeatable behavior and stable observability.
- Open stream subscriptions on all canonical nodes
- For each canonical node, open a stream for the same topic.
- Ensure the stream is “ready” before proceeding (so we don’t measure subscription startup latency).
- We want to isolate propagation latency, not client setup latency.
- Write a message to any node on the topic
- The writer node can be selected deterministically or randomly.
- The write must return a confirmation we can treat as the “write confirmed” timestamp boundary.
- Record time-to-observe per node
- For each stream, record the time delta between write_confirmed_at and payload_received_at_on_node_X.
- Some useful things to store:
- per-node latency,
- writer node identity,
- topic,
- message id or unique marker.
- Expose measurements
- Export metrics in a Prometheus-friendly format (and/or pushgateway).
Why the Probe Must Use Its Own Payer
The tool should operate as its own payer (and not share payer identity or payer logic with other systems) to avoid measurement instability and false alarms.
Network latency corrections
It might be useful to automatically clean the results and remove the probe<>node latencies. We are predominantly interested in estimating the replication lag between nodes.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status