Skip to content

Commit 3581839

Browse files
Phase 5: Observability stack — spanmetrics, dashboards, runbook
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent 6e75849 commit 3581839

File tree

20 files changed

+1983
-7
lines changed

20 files changed

+1983
-7
lines changed

OpenTelemetryPlan/Phase2_taskList.md

Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -185,3 +185,36 @@
185185
| 2.6 | Build verification and performance baseline | 0 | 0 | 2.1-2.5 |
186186

187187
**Parallel work**: Tasks 2.1, 2.2, 2.3 can run in parallel. Task 2.4 depends on 2.3. Task 2.5 can run in parallel with 2.4. Task 2.6 depends on all others.
188+
189+
---
190+
191+
## Known Issues / Future Work
192+
193+
### Thread safety of TelemetryImpl::stop() vs startSpan()
194+
195+
`TelemetryImpl::stop()` resets `sdkProvider_` (a `std::shared_ptr`) without
196+
synchronization. `getTracer()` reads the same member from RPC handler threads.
197+
This is a data race if any thread calls `startSpan()` concurrently with `stop()`.
198+
199+
**Current mitigation**: `Application::stop()` shuts down `serverHandler_`,
200+
`overlay_`, and `jobQueue_` before calling `telemetry_->stop()`, so no callers
201+
remain. See comments in `Telemetry.cpp:stop()` and `Application.cpp`.
202+
203+
**TODO**: Add an `std::atomic<bool> stopped_` flag checked in `getTracer()` to
204+
make this robust against future shutdown order changes.
205+
206+
### Macro incompatibility: XRPL_TRACE_SPAN vs XRPL_TRACE_SET_ATTR
207+
208+
`XRPL_TRACE_SPAN` and `XRPL_TRACE_SPAN_KIND` declare `_xrpl_guard_` as a bare
209+
`SpanGuard`, but `XRPL_TRACE_SET_ATTR` and `XRPL_TRACE_EXCEPTION` call
210+
`_xrpl_guard_.has_value()` which requires `std::optional<SpanGuard>`. Using
211+
`XRPL_TRACE_SPAN` followed by `XRPL_TRACE_SET_ATTR` in the same scope would
212+
fail to compile.
213+
214+
**Current mitigation**: No call site currently uses `XRPL_TRACE_SPAN` — all
215+
production code uses the conditional macros (`XRPL_TRACE_RPC`, `XRPL_TRACE_TX`,
216+
etc.) which correctly wrap the guard in `std::optional`.
217+
218+
**TODO**: Either make `XRPL_TRACE_SPAN`/`XRPL_TRACE_SPAN_KIND` also wrap in
219+
`std::optional`, or document that `XRPL_TRACE_SET_ATTR` is only compatible with
220+
the conditional macros.

OpenTelemetryPlan/Phase3_taskList.md

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -236,3 +236,28 @@
236236
- [ ] Trace context in Protocol Buffer messages
237237
- [ ] HashRouter deduplication visible in traces
238238
- [ ] <5% overhead on transaction throughput
239+
240+
---
241+
242+
## Known Issues / Future Work
243+
244+
### Propagation utilities not yet wired into P2P flow
245+
246+
`extractFromProtobuf()` and `injectToProtobuf()` in `TraceContextPropagator.h`
247+
are implemented and tested but not called from production code. To enable
248+
cross-node distributed traces:
249+
250+
- Call `injectToProtobuf()` in `PeerImp` when sending `TMTransaction` /
251+
`TMProposeSet` messages
252+
- Call `extractFromProtobuf()` in the corresponding message handlers to
253+
reconstruct the parent span context, then pass it to `startSpan()` as the
254+
parent
255+
256+
This was deferred to validate single-node tracing performance first.
257+
258+
### Unused trace_state proto field
259+
260+
The `TraceContext.trace_state` field (field 4) in `xrpl.proto` is reserved for
261+
W3C `tracestate` vendor-specific key-value pairs but is not read or written by
262+
`TraceContextPropagator`. Wire it when cross-vendor trace propagation is needed.
263+
No wire cost since proto `optional` fields are zero-cost when absent.
Lines changed: 221 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,221 @@
1+
# Phase 5: Integration Test Task List
2+
3+
> **Goal**: End-to-end verification of the complete telemetry pipeline using a
4+
> 6-node consensus network. Proves that RPC, transaction, and consensus spans
5+
> flow through the observability stack (otel-collector, Jaeger, Prometheus,
6+
> Grafana) under realistic conditions.
7+
>
8+
> **Scope**: Integration test script, manual testing plan, 6-node local network
9+
> setup, Jaeger/Prometheus/Grafana verification.
10+
>
11+
> **Branch**: `pratik/otel-phase5-docs-deployment`
12+
13+
### Related Plan Documents
14+
15+
| Document | Relevance |
16+
| ---------------------------------------------------------------- | ------------------------------------------ |
17+
| [07-observability-backends.md](./07-observability-backends.md) | Jaeger, Grafana, Prometheus setup |
18+
| [05-configuration-reference.md](./05-configuration-reference.md) | Collector config, Docker Compose |
19+
| [06-implementation-phases.md](./06-implementation-phases.md) | Phase 5 tasks, definition of done |
20+
| [Phase5_taskList.md](./Phase5_taskList.md) | Phase 5 main task list (5.6 = integration) |
21+
22+
---
23+
24+
## Task IT.1: Create Integration Test Script
25+
26+
**Objective**: Automated bash script that stands up a 6-node xrpld network
27+
with telemetry, exercises all span categories, and verifies data in
28+
Jaeger/Prometheus.
29+
30+
**What to do**:
31+
32+
- Create `docker/telemetry/integration-test.sh`:
33+
- Prerequisites check (docker, xrpld binary, curl, jq)
34+
- Start observability stack via `docker compose`
35+
- Generate 6 validator key pairs via temp standalone xrpld
36+
- Generate 6 node configs + shared `validators.txt`
37+
- Start 6 xrpld nodes in consensus mode (`--start`, no `-a`)
38+
- Wait for all nodes to reach `"proposing"` state (120s timeout)
39+
40+
**Key new file**: `docker/telemetry/integration-test.sh`
41+
42+
**Verification**:
43+
44+
- [ ] Script starts without errors
45+
- [ ] All 6 nodes reach "proposing" state
46+
- [ ] Observability stack is healthy (otel-collector, Jaeger, Prometheus, Grafana)
47+
48+
---
49+
50+
## Task IT.2: RPC Span Verification (Phase 2)
51+
52+
**Objective**: Verify RPC spans flow through the telemetry pipeline.
53+
54+
**What to do**:
55+
56+
- Send `server_info`, `server_state`, `ledger` RPCs to node1 (port 5005)
57+
- Wait for batch export (5s)
58+
- Query Jaeger API for:
59+
- `rpc.request` spans (ServerHandler::onRequest)
60+
- `rpc.process` spans (ServerHandler::processRequest)
61+
- `rpc.command.server_info` spans (callMethod)
62+
- `rpc.command.server_state` spans (callMethod)
63+
- `rpc.command.ledger` spans (callMethod)
64+
- Verify `xrpl.rpc.command` attribute present on `rpc.command.*` spans
65+
66+
**Verification**:
67+
68+
- [ ] Jaeger shows `rpc.request` traces
69+
- [ ] Jaeger shows `rpc.process` traces
70+
- [ ] Jaeger shows `rpc.command.*` traces with correct attributes
71+
72+
---
73+
74+
## Task IT.3: Transaction Span Verification (Phase 3)
75+
76+
**Objective**: Verify transaction spans flow through the telemetry pipeline.
77+
78+
**What to do**:
79+
80+
- Get genesis account sequence via `account_info` RPC
81+
- Submit Payment transaction using genesis seed (`snoPBrXtMeMyMHUVTgbuqAfg1SUTb`)
82+
- Wait for consensus inclusion (10s)
83+
- Query Jaeger API for:
84+
- `tx.process` spans (NetworkOPsImp::processTransaction) on submitting node
85+
- `tx.receive` spans (PeerImp::handleTransaction) on peer nodes
86+
- Verify `xrpl.tx.hash` attribute on `tx.process` spans
87+
- Verify `xrpl.peer.id` attribute on `tx.receive` spans
88+
89+
**Verification**:
90+
91+
- [ ] Jaeger shows `tx.process` traces with `xrpl.tx.hash`
92+
- [ ] Jaeger shows `tx.receive` traces with `xrpl.peer.id`
93+
94+
---
95+
96+
## Task IT.4: Consensus Span Verification (Phase 4)
97+
98+
**Objective**: Verify consensus spans flow through the telemetry pipeline.
99+
100+
**What to do**:
101+
102+
- Consensus runs automatically in 6-node network
103+
- Query Jaeger API for:
104+
- `consensus.proposal.send` (Adaptor::propose)
105+
- `consensus.ledger_close` (Adaptor::onClose)
106+
- `consensus.accept` (Adaptor::onAccept)
107+
- `consensus.validation.send` (Adaptor::validate)
108+
- Verify attributes:
109+
- `xrpl.consensus.mode` on `consensus.ledger_close`
110+
- `xrpl.consensus.proposers` on `consensus.accept`
111+
- `xrpl.consensus.ledger.seq` on `consensus.validation.send`
112+
113+
**Verification**:
114+
115+
- [ ] Jaeger shows `consensus.ledger_close` traces with `xrpl.consensus.mode`
116+
- [ ] Jaeger shows `consensus.accept` traces with `xrpl.consensus.proposers`
117+
- [ ] Jaeger shows `consensus.proposal.send` traces
118+
- [ ] Jaeger shows `consensus.validation.send` traces
119+
120+
---
121+
122+
## Task IT.5: Spanmetrics Verification (Phase 5)
123+
124+
**Objective**: Verify spanmetrics connector derives RED metrics from spans.
125+
126+
**What to do**:
127+
128+
- Query Prometheus for `traces_span_metrics_calls_total`
129+
- Query Prometheus for `traces_span_metrics_duration_milliseconds_count`
130+
- Verify Grafana loads at `http://localhost:3000`
131+
132+
**Verification**:
133+
134+
- [ ] Prometheus returns non-empty results for `traces_span_metrics_calls_total`
135+
- [ ] Prometheus returns non-empty results for duration histogram
136+
- [ ] Grafana UI accessible with dashboards visible
137+
138+
---
139+
140+
## Task IT.6: Manual Testing Plan
141+
142+
**Objective**: Document how to run tests manually for future reference.
143+
144+
**What to do**:
145+
146+
- Create `docker/telemetry/TESTING.md` with:
147+
- Prerequisites section
148+
- Single-node standalone test (quick verification)
149+
- 6-node consensus test (full verification)
150+
- Expected span catalog (all 12 span names with attributes)
151+
- Verification queries (Jaeger API, Prometheus API)
152+
- Troubleshooting guide
153+
154+
**Key new file**: `docker/telemetry/TESTING.md`
155+
156+
**Verification**:
157+
158+
- [ ] Document covers both single-node and multi-node testing
159+
- [ ] All 12 span names documented with source file and attributes
160+
- [ ] Troubleshooting section covers common failure modes
161+
162+
---
163+
164+
## Task IT.7: Run and Verify
165+
166+
**Objective**: Execute the integration test and validate results.
167+
168+
**What to do**:
169+
170+
- Run `docker/telemetry/integration-test.sh` locally
171+
- Debug any failures
172+
- Leave stack running for manual verification
173+
- Share URLs:
174+
- Jaeger: `http://localhost:16686`
175+
- Grafana: `http://localhost:3000`
176+
- Prometheus: `http://localhost:9090`
177+
178+
**Verification**:
179+
180+
- [ ] Script completes with all checks passing
181+
- [ ] Jaeger UI shows rippled service with all expected span names
182+
- [ ] Grafana dashboards load and show data
183+
184+
---
185+
186+
## Task IT.8: Commit
187+
188+
**Objective**: Commit all new files to Phase 5 branch.
189+
190+
**What to do**:
191+
192+
- Run `pcc` (pre-commit checks)
193+
- Commit 3 new files to `pratik/otel-phase5-docs-deployment`
194+
195+
**Verification**:
196+
197+
- [ ] `pcc` passes
198+
- [ ] Commit created on Phase 5 branch
199+
200+
---
201+
202+
## Summary
203+
204+
| Task | Description | New Files | Depends On |
205+
| ---- | ----------------------------- | --------- | ---------- |
206+
| IT.1 | Integration test script | 1 | Phase 5 |
207+
| IT.2 | RPC span verification | 0 | IT.1 |
208+
| IT.3 | Transaction span verification | 0 | IT.1 |
209+
| IT.4 | Consensus span verification | 0 | IT.1 |
210+
| IT.5 | Spanmetrics verification | 0 | IT.1 |
211+
| IT.6 | Manual testing plan | 1 | -- |
212+
| IT.7 | Run and verify | 0 | IT.1-IT.6 |
213+
| IT.8 | Commit | 0 | IT.7 |
214+
215+
**Exit Criteria**:
216+
217+
- [ ] All 6 xrpld nodes reach "proposing" state
218+
- [ ] All 11 expected span names visible in Jaeger
219+
- [ ] Spanmetrics available in Prometheus
220+
- [ ] Grafana dashboards show data
221+
- [ ] Manual testing plan document complete

cspell.config.yaml

Lines changed: 8 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -99,6 +99,7 @@ words:
9999
- doxyfile
100100
- dxrpl
101101
- endmacro
102+
- EOCFG
102103
- exceptioned
103104
- Falco
104105
- finalizers
@@ -193,6 +194,8 @@ words:
193194
- permdex
194195
- perminute
195196
- permissioned
197+
- pgrep
198+
- pkill
196199
- pointee
197200
- populator
198201
- pratik
@@ -210,6 +213,7 @@ words:
210213
- queuable
211214
- Raphson
212215
- replayer
216+
- reqps
213217
- rerere
214218
- retriable
215219
- RIPD
@@ -304,6 +308,10 @@ words:
304308
- xchain
305309
- ximinez
306310
- EXPECT_STREQ
311+
- Gantt
312+
- gantt
313+
- otelc
314+
- traceql
307315
- XMACRO
308316
- xrpkuwait
309317
- xrpl
@@ -312,8 +320,4 @@ words:
312320
- xxhash
313321
- xxhasher
314322
- xychart
315-
- otelc
316323
- zpages
317-
- traceql
318-
- Gantt
319-
- gantt

0 commit comments

Comments
 (0)