Skip to content

Commit 2391149

Browse files
Add TODO comments for known issues, update task lists + cspell
Document known issues with inline TODO comments: - Thread safety of TelemetryImpl::stop() vs startSpan() in Telemetry.cpp and Application.cpp shutdown sequence - Unwired propagation utilities in TraceContextPropagator.h - Unused trace_state proto field in xrpl.proto Add "Known Issues / Future Work" sections to Phase 2 and Phase 3 task lists covering: thread safety, macro incompatibility between XRPL_TRACE_SPAN and XRPL_TRACE_SET_ATTR, unwired P2P propagation, and unused trace_state field. Add "reqps" to cspell dictionary and sort trailing entries. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent 84f38b6 commit 2391149

File tree

7 files changed

+85
-4
lines changed

7 files changed

+85
-4
lines changed

OpenTelemetryPlan/Phase2_taskList.md

Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -185,3 +185,36 @@
185185
| 2.6 | Build verification and performance baseline | 0 | 0 | 2.1-2.5 |
186186

187187
**Parallel work**: Tasks 2.1, 2.2, 2.3 can run in parallel. Task 2.4 depends on 2.3. Task 2.5 can run in parallel with 2.4. Task 2.6 depends on all others.
188+
189+
---
190+
191+
## Known Issues / Future Work
192+
193+
### Thread safety of TelemetryImpl::stop() vs startSpan()
194+
195+
`TelemetryImpl::stop()` resets `sdkProvider_` (a `std::shared_ptr`) without
196+
synchronization. `getTracer()` reads the same member from RPC handler threads.
197+
This is a data race if any thread calls `startSpan()` concurrently with `stop()`.
198+
199+
**Current mitigation**: `Application::stop()` shuts down `serverHandler_`,
200+
`overlay_`, and `jobQueue_` before calling `telemetry_->stop()`, so no callers
201+
remain. See comments in `Telemetry.cpp:stop()` and `Application.cpp`.
202+
203+
**TODO**: Add an `std::atomic<bool> stopped_` flag checked in `getTracer()` to
204+
make this robust against future shutdown order changes.
205+
206+
### Macro incompatibility: XRPL_TRACE_SPAN vs XRPL_TRACE_SET_ATTR
207+
208+
`XRPL_TRACE_SPAN` and `XRPL_TRACE_SPAN_KIND` declare `_xrpl_guard_` as a bare
209+
`SpanGuard`, but `XRPL_TRACE_SET_ATTR` and `XRPL_TRACE_EXCEPTION` call
210+
`_xrpl_guard_.has_value()` which requires `std::optional<SpanGuard>`. Using
211+
`XRPL_TRACE_SPAN` followed by `XRPL_TRACE_SET_ATTR` in the same scope would
212+
fail to compile.
213+
214+
**Current mitigation**: No call site currently uses `XRPL_TRACE_SPAN` — all
215+
production code uses the conditional macros (`XRPL_TRACE_RPC`, `XRPL_TRACE_TX`,
216+
etc.) which correctly wrap the guard in `std::optional`.
217+
218+
**TODO**: Either make `XRPL_TRACE_SPAN`/`XRPL_TRACE_SPAN_KIND` also wrap in
219+
`std::optional`, or document that `XRPL_TRACE_SET_ATTR` is only compatible with
220+
the conditional macros.

OpenTelemetryPlan/Phase3_taskList.md

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -236,3 +236,28 @@
236236
- [ ] Trace context in Protocol Buffer messages
237237
- [ ] HashRouter deduplication visible in traces
238238
- [ ] <5% overhead on transaction throughput
239+
240+
---
241+
242+
## Known Issues / Future Work
243+
244+
### Propagation utilities not yet wired into P2P flow
245+
246+
`extractFromProtobuf()` and `injectToProtobuf()` in `TraceContextPropagator.h`
247+
are implemented and tested but not called from production code. To enable
248+
cross-node distributed traces:
249+
250+
- Call `injectToProtobuf()` in `PeerImp` when sending `TMTransaction` /
251+
`TMProposeSet` messages
252+
- Call `extractFromProtobuf()` in the corresponding message handlers to
253+
reconstruct the parent span context, then pass it to `startSpan()` as the
254+
parent
255+
256+
This was deferred to validate single-node tracing performance first.
257+
258+
### Unused trace_state proto field
259+
260+
The `TraceContext.trace_state` field (field 4) in `xrpl.proto` is reserved for
261+
W3C `tracestate` vendor-specific key-value pairs but is not read or written by
262+
`TraceContextPropagator`. Wire it when cross-vendor trace propagation is needed.
263+
No wire cost since proto `optional` fields are zero-cost when absent.

cspell.config.yaml

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -210,6 +210,7 @@ words:
210210
- queuable
211211
- Raphson
212212
- replayer
213+
- reqps
213214
- rerere
214215
- retriable
215216
- RIPD
@@ -304,6 +305,10 @@ words:
304305
- xchain
305306
- ximinez
306307
- EXPECT_STREQ
308+
- Gantt
309+
- gantt
310+
- otelc
311+
- traceql
307312
- XMACRO
308313
- xrpkuwait
309314
- xrpl
@@ -312,8 +317,4 @@ words:
312317
- xxhash
313318
- xxhasher
314319
- xychart
315-
- otelc
316320
- zpages
317-
- traceql
318-
- Gantt
319-
- gantt

include/xrpl/proto/xrpl.proto

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -91,6 +91,10 @@ message TraceContext {
9191
optional bytes trace_id = 1; // 16-byte trace identifier
9292
optional bytes span_id = 2; // 8-byte parent span identifier
9393
optional uint32 trace_flags = 3; // bit 0 = sampled
94+
// TODO: trace_state is reserved for W3C tracestate vendor-specific
95+
// key-value pairs but is not yet read or written by
96+
// TraceContextPropagator. Wire it when cross-vendor trace
97+
// propagation is needed.
9498
optional string trace_state = 4; // W3C tracestate header value
9599
}
96100

include/xrpl/telemetry/TraceContextPropagator.h

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,13 @@
66
Protocol Buffer TraceContext messages (P2P cross-node propagation).
77
88
Only compiled when XRPL_ENABLE_TELEMETRY is defined.
9+
10+
TODO: These utilities are not yet wired into the P2P message flow.
11+
To enable cross-node distributed traces, call injectToProtobuf() in
12+
PeerImp when sending TMTransaction/TMProposeSet messages, and call
13+
extractFromProtobuf() in the corresponding message handlers to
14+
reconstruct the parent span context before starting a child span.
15+
This was deferred to validate single-node tracing performance first.
916
*/
1017

1118
#ifdef XRPL_ENABLE_TELEMETRY

src/libxrpl/telemetry/Telemetry.cpp

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -200,6 +200,13 @@ class TelemetryImpl : public Telemetry
200200
{
201201
// Force flush before shutdown
202202
sdkProvider_->ForceFlush();
203+
// TODO: sdkProvider_ is not thread-safe. This reset() races with
204+
// getTracer() if any thread is still calling startSpan().
205+
// Currently safe because Application::stop() shuts down
206+
// serverHandler_, overlay_, and jobQueue_ before calling
207+
// telemetry_->stop() — so no callers should remain. If the
208+
// shutdown order ever changes, add an std::atomic<bool> stopped_
209+
// flag checked in getTracer() to make this robust.
203210
sdkProvider_.reset();
204211
trace_api::Provider::SetTracerProvider(
205212
opentelemetry::nostd::shared_ptr<trace_api::TracerProvider>(

src/xrpld/app/main/Application.cpp

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1570,6 +1570,10 @@ ApplicationImp::run()
15701570
ledgerCleaner_->stop();
15711571
m_nodeStore->stop();
15721572
perfLog_->stop();
1573+
// Telemetry must stop last among trace-producing components.
1574+
// serverHandler_, overlay_, and jobQueue_ are already stopped above,
1575+
// so no threads should be calling startSpan() at this point.
1576+
// See TODO in TelemetryImpl::stop() re: thread-safety of sdkProvider_.
15731577
telemetry_->stop();
15741578

15751579
JLOG(m_journal.info()) << "Done.";

0 commit comments

Comments
 (0)