Skip to content

Commit 2c0b65a

Browse files
Phase 6: Integrate beast::insight StatsD metrics into telemetry pipeline
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent 3581839 commit 2c0b65a

17 files changed

+2542
-73
lines changed

OpenTelemetryPlan/06-implementation-phases.md

Lines changed: 91 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -182,7 +182,80 @@ gantt
182182

183183
---
184184

185-
## 6.7 Risk Assessment
185+
## 6.7 Phase 6: StatsD Metrics Integration (Week 10)
186+
187+
**Objective**: Bridge rippled's existing `beast::insight` StatsD metrics into the OpenTelemetry collection pipeline, exposing 300+ pre-existing metrics alongside span-derived RED metrics in Prometheus/Grafana.
188+
189+
### Background
190+
191+
rippled has a mature metrics framework (`beast::insight`) that emits StatsD-format metrics over UDP. These metrics cover node health, peer networking, RPC performance, job queue, and overlay traffic — data that **does not** overlap with the span-based instrumentation from Phases 1-5. By adding a StatsD receiver to the OTel Collector, both metric sources converge in Prometheus.
192+
193+
### Metric Inventory
194+
195+
| Category | Group | Type | Count | Key Metrics |
196+
| --------------- | ------------------ | ------------- | ---------- | ------------------------------------------------------ |
197+
| Node State | `State_Accounting` | Gauge | 10 | `*_duration`, `*_transitions` per operating mode |
198+
| Ledger | `LedgerMaster` | Gauge | 2 | `Validated_Ledger_Age`, `Published_Ledger_Age` |
199+
| Ledger Fetch || Counter | 1 | `ledger_fetches` |
200+
| Ledger History | `ledger.history` | Counter | 1 | `mismatch` |
201+
| RPC | `rpc` | Counter+Event | 3 | `requests`, `time` (histogram), `size` (histogram) |
202+
| Job Queue || Gauge+Event | 1 + 2×N | `job_count`, per-job `{name}` and `{name}_q` |
203+
| Peer Finder | `Peer_Finder` | Gauge | 2 | `Active_Inbound_Peers`, `Active_Outbound_Peers` |
204+
| Overlay | `Overlay` | Gauge | 1 | `Peer_Disconnects` |
205+
| Overlay Traffic | per-category | Gauge | 4×57 = 228 | `Bytes_In/Out`, `Messages_In/Out` per traffic category |
206+
| Pathfinding || Event | 2 | `pathfind_fast`, `pathfind_full` (histograms) |
207+
| I/O || Event | 1 | `ios_latency` (histogram) |
208+
| Resource Mgr || Meter | 2 | `warn`, `drop` (rate counters) |
209+
| Caches | per-cache | Gauge | 2×N | `{cache}.size`, `{cache}.hit_rate` |
210+
211+
**Total**: ~255+ unique metrics (plus dynamic job-type and cache metrics)
212+
213+
### Tasks
214+
215+
| Task | Description | Effort | Risk |
216+
| ---- | --------------------------------------------------------------------------------------------------------------- | ------ | ---- |
217+
| 6.1 | **DEFERRED** Fix Meter wire format (`\|m``\|c`) in StatsDCollector.cpp — breaking change, tracked separately | 0.5d | Low |
218+
| 6.2 | Add `statsd` receiver to OTel Collector config | 0.5d | Low |
219+
| 6.3 | Expose UDP port 8125 in docker-compose.yml | 0.1d | Low |
220+
| 6.4 | Add `[insight]` config to integration test node configs | 0.5d | Low |
221+
| 6.5 | Create "Node Health" Grafana dashboard (8 panels) | 1d | Low |
222+
| 6.6 | Create "Network Traffic" Grafana dashboard (8 panels) | 1d | Low |
223+
| 6.7 | Create "RPC & Pathfinding (StatsD)" Grafana dashboard (8 panels) | 1d | Low |
224+
| 6.8 | Update integration test to verify StatsD metrics in Prometheus | 0.5d | Low |
225+
| 6.9 | Update TESTING.md and telemetry-runbook.md | 0.5d | Low |
226+
227+
**Total Effort**: 5.6 days
228+
229+
### Wire Format Fix (Task 6.1) — DEFERRED
230+
231+
The `StatsDMeterImpl` in `StatsDCollector.cpp:706` sends metrics with `|m` suffix, which is non-standard StatsD. The OTel StatsD receiver silently drops these. Fix: change `|m` to `|c` (counter), which is semantically correct since meters are increment-only counters. Only 2 metrics are affected (`warn`, `drop` in Resource Manager).
232+
233+
**Status**: Deferred as a separate change — this is a breaking change for any StatsD backend that previously consumed the custom `|m` type. The Resource Warnings and Resource Drops dashboard panels will show no data until this fix is applied.
234+
235+
### New Grafana Dashboards
236+
237+
**Node Health** (`statsd-node-health.json`, uid: `rippled-statsd-node-health`):
238+
239+
- Validated/Published Ledger Age, Operating Mode Duration/Transitions, I/O Latency, Job Queue Depth, Ledger Fetch Rate, Ledger History Mismatches
240+
241+
**Network Traffic** (`statsd-network-traffic.json`, uid: `rippled-statsd-network`):
242+
243+
- Active Inbound/Outbound Peers, Peer Disconnects, Total Bytes/Messages In/Out, Transaction/Proposal/Validation Traffic, Top Traffic Categories
244+
245+
**RPC & Pathfinding (StatsD)** (`statsd-rpc-pathfinding.json`, uid: `rippled-statsd-rpc`):
246+
247+
- RPC Request Rate, Response Time p95/p50, Response Size p95/p50, Pathfinding Fast/Full Duration, Resource Warnings/Drops, Response Time Heatmap
248+
249+
### Exit Criteria
250+
251+
- [ ] StatsD metrics visible in Prometheus (`curl localhost:9090/api/v1/query?query=rippled_LedgerMaster_Validated_Ledger_Age`)
252+
- [ ] All 3 new Grafana dashboards load without errors
253+
- [ ] Integration test verifies at least core StatsD metrics (ledger age, peer counts, RPC requests)
254+
- [ ] ~~Meter metrics (`warn`, `drop`) flow correctly after `|m``|c` fix~~ — DEFERRED (breaking change, tracked separately)
255+
256+
---
257+
258+
## 6.9 Risk Assessment
186259

187260
```mermaid
188261
quadrantChart
@@ -213,7 +286,7 @@ quadrantChart
213286

214287
---
215288

216-
## 6.8 Success Metrics
289+
## 6.10 Success Metrics
217290

218291
| Metric | Target | Measurement |
219292
| ------------------------ | ------------------------------ | --------------------- |
@@ -226,7 +299,7 @@ quadrantChart
226299

227300
---
228301

229-
## 6.9 Effort Summary
302+
## 6.11 Effort Summary
230303

231304
<div align="center">
232305

@@ -257,11 +330,11 @@ pie showData
257330

258331
---
259332

260-
## 6.10 Quick Wins and Crawl-Walk-Run Strategy
333+
## 6.12 Quick Wins and Crawl-Walk-Run Strategy
261334

262335
This section outlines a prioritized approach to maximize ROI with minimal initial investment.
263336

264-
### 6.10.1 Crawl-Walk-Run Overview
337+
### 6.12.1 Crawl-Walk-Run Overview
265338

266339
<div align="center">
267340

@@ -300,7 +373,7 @@ flowchart TB
300373

301374
</div>
302375

303-
### 6.10.2 Quick Wins (Immediate Value)
376+
### 6.12.2 Quick Wins (Immediate Value)
304377

305378
| Quick Win | Effort | Value | When to Deploy |
306379
| ------------------------------ | -------- | ------ | -------------- |
@@ -310,7 +383,7 @@ flowchart TB
310383
| **Transaction Submit Tracing** | 1 day | High | Week 3 |
311384
| **Consensus Round Duration** | 1 day | Medium | Week 6 |
312385

313-
### 6.10.3 CRAWL Phase (Weeks 1-2)
386+
### 6.12.3 CRAWL Phase (Weeks 1-2)
314387

315388
**Goal**: Get basic tracing working with minimal code changes.
316389

@@ -330,7 +403,7 @@ flowchart TB
330403
- No cross-node complexity
331404
- Single file modification to existing code
332405

333-
### 6.10.4 WALK Phase (Weeks 3-5)
406+
### 6.12.4 WALK Phase (Weeks 3-5)
334407

335408
**Goal**: Add transaction lifecycle tracing across nodes.
336409

@@ -349,7 +422,7 @@ flowchart TB
349422
- Moderate complexity (requires context propagation)
350423
- High value for debugging transaction issues
351424

352-
### 6.10.5 RUN Phase (Weeks 6-9)
425+
### 6.12.5 RUN Phase (Weeks 6-9)
353426

354427
**Goal**: Full observability including consensus.
355428

@@ -368,7 +441,7 @@ flowchart TB
368441
- Requires thorough testing
369442
- Lower relative value (consensus issues are rarer)
370443

371-
### 6.10.6 ROI Prioritization Matrix
444+
### 6.12.6 ROI Prioritization Matrix
372445

373446
```mermaid
374447
quadrantChart
@@ -390,11 +463,11 @@ quadrantChart
390463

391464
---
392465

393-
## 6.11 Definition of Done
466+
## 6.13 Definition of Done
394467

395468
Clear, measurable criteria for each phase.
396469

397-
### 6.11.1 Phase 1: Core Infrastructure
470+
### 6.13.1 Phase 1: Core Infrastructure
398471

399472
| Criterion | Measurement | Target |
400473
| --------------- | ---------------------------------------------------------- | ---------------------------- |
@@ -406,7 +479,7 @@ Clear, measurable criteria for each phase.
406479

407480
**Definition of Done**: All criteria met, PR merged, no regressions in CI.
408481

409-
### 6.11.2 Phase 2: RPC Tracing
482+
### 6.13.2 Phase 2: RPC Tracing
410483

411484
| Criterion | Measurement | Target |
412485
| ------------------ | ---------------------------------- | -------------------------- |
@@ -418,7 +491,7 @@ Clear, measurable criteria for each phase.
418491

419492
**Definition of Done**: RPC traces visible in Jaeger/Tempo for all commands, dashboard shows latency distribution.
420493

421-
### 6.11.3 Phase 3: Transaction Tracing
494+
### 6.13.3 Phase 3: Transaction Tracing
422495

423496
| Criterion | Measurement | Target |
424497
| ---------------- | ------------------------------- | ---------------------------------- |
@@ -430,7 +503,7 @@ Clear, measurable criteria for each phase.
430503

431504
**Definition of Done**: Transaction traces span 3+ nodes in test network, performance within bounds.
432505

433-
### 6.11.4 Phase 4: Consensus Tracing
506+
### 6.13.4 Phase 4: Consensus Tracing
434507

435508
| Criterion | Measurement | Target |
436509
| -------------------- | ----------------------------- | ------------------------- |
@@ -442,7 +515,7 @@ Clear, measurable criteria for each phase.
442515

443516
**Definition of Done**: Consensus rounds fully traceable, no impact on consensus timing.
444517

445-
### 6.11.5 Phase 5: Production Deployment
518+
### 6.13.5 Phase 5: Production Deployment
446519

447520
| Criterion | Measurement | Target |
448521
| ------------ | ---------------------------- | -------------------------- |
@@ -455,7 +528,7 @@ Clear, measurable criteria for each phase.
455528

456529
**Definition of Done**: Telemetry running in production, operators trained, alerts active.
457530

458-
### 6.11.6 Success Metrics Summary
531+
### 6.13.6 Success Metrics Summary
459532

460533
| Phase | Primary Metric | Secondary Metric | Deadline |
461534
| ------- | ---------------------- | --------------------------- | ------------- |
@@ -467,7 +540,7 @@ Clear, measurable criteria for each phase.
467540

468541
---
469542

470-
## 6.12 Recommended Implementation Order
543+
## 6.14 Recommended Implementation Order
471544

472545
Based on ROI analysis, implement in this exact order:
473546

docker/telemetry/TESTING.md

Lines changed: 81 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -370,7 +370,7 @@ See the "Verification Queries" section below.
370370

371371
## Expected Span Catalog
372372

373-
All 12 production span names instrumented across Phases 2-4:
373+
All 16 production span names instrumented across Phases 2-5:
374374

375375
| Span Name | Source File | Phase | Key Attributes | How to Trigger |
376376
| --------------------------- | --------------------- | ----- | ---------------------------------------------------------- | ------------------------- |
@@ -380,10 +380,16 @@ All 12 production span names instrumented across Phases 2-4:
380380
| `rpc.command.<name>` | RPCHandler.cpp:161 | 2 | `xrpl.rpc.command`, `xrpl.rpc.version`, `xrpl.rpc.role` | Any RPC command |
381381
| `tx.process` | NetworkOPs.cpp:1227 | 3 | `xrpl.tx.hash`, `xrpl.tx.local`, `xrpl.tx.path` | Submit transaction |
382382
| `tx.receive` | PeerImp.cpp:1273 | 3 | `xrpl.peer.id` | Peer relays transaction |
383+
| `tx.apply` | BuildLedger.cpp:88 | 5 | `xrpl.ledger.tx_count`, `xrpl.ledger.tx_failed` | Ledger close (tx set) |
383384
| `consensus.proposal.send` | RCLConsensus.cpp:177 | 4 | `xrpl.consensus.round` | Consensus proposing phase |
384385
| `consensus.ledger_close` | RCLConsensus.cpp:282 | 4 | `xrpl.consensus.ledger.seq`, `xrpl.consensus.mode` | Ledger close event |
385386
| `consensus.accept` | RCLConsensus.cpp:395 | 4 | `xrpl.consensus.proposers`, `xrpl.consensus.round_time_ms` | Ledger accepted |
386387
| `consensus.validation.send` | RCLConsensus.cpp:753 | 4 | `xrpl.consensus.ledger.seq`, `xrpl.consensus.proposing` | Validation sent |
388+
| `ledger.build` | BuildLedger.cpp:31 | 5 | `xrpl.ledger.seq` | Ledger build |
389+
| `ledger.validate` | LedgerMaster.cpp:915 | 5 | `xrpl.ledger.seq`, `xrpl.ledger.validations` | Ledger validated |
390+
| `ledger.store` | LedgerMaster.cpp:409 | 5 | `xrpl.ledger.seq` | Ledger stored |
391+
| `peer.proposal.receive` | PeerImp.cpp:1667 | 5 | `xrpl.peer.id`, `xrpl.peer.proposal.trusted` | Peer sends proposal |
392+
| `peer.validation.receive` | PeerImp.cpp:2264 | 5 | `xrpl.peer.id`, `xrpl.peer.validation.trusted` | Peer sends validation |
387393

388394
---
389395

@@ -405,9 +411,11 @@ curl -s "$JAEGER/api/services/rippled/operations" | jq '.data'
405411
# Query traces by operation
406412
for op in "rpc.request" "rpc.process" \
407413
"rpc.command.server_info" "rpc.command.server_state" "rpc.command.ledger" \
408-
"tx.process" "tx.receive" \
414+
"tx.process" "tx.receive" "tx.apply" \
409415
"consensus.proposal.send" "consensus.ledger_close" \
410-
"consensus.accept" "consensus.validation.send"; do
416+
"consensus.accept" "consensus.validation.send" \
417+
"ledger.build" "ledger.validate" "ledger.store" \
418+
"peer.proposal.receive" "peer.validation.receive"; do
411419
count=$(curl -s "$JAEGER/api/traces?service=rippled&operation=$op&limit=5&lookback=1h" \
412420
| jq '.data | length')
413421
printf "%-35s %s traces\n" "$op" "$count"
@@ -434,15 +442,81 @@ curl -s "$PROM/api/v1/query?query=traces_span_metrics_calls_total{span_name=~\"r
434442
| jq '.data.result[] | {command: .metric["xrpl.rpc.command"], count: .value[1]}'
435443
```
436444

445+
### StatsD Metrics (beast::insight)
446+
447+
rippled's built-in `beast::insight` framework emits StatsD metrics over UDP to the OTel Collector
448+
on port 8125. These appear in Prometheus alongside spanmetrics.
449+
450+
Requires `[insight]` config in `xrpld.cfg`:
451+
452+
```ini
453+
[insight]
454+
server=statsd
455+
address=127.0.0.1:8125
456+
prefix=rippled
457+
```
458+
459+
Verify StatsD metrics in Prometheus:
460+
461+
```bash
462+
# Ledger age gauge
463+
curl -s "$PROM/api/v1/query?query=rippled_LedgerMaster_Validated_Ledger_Age" | jq '.data.result'
464+
465+
# Peer counts
466+
curl -s "$PROM/api/v1/query?query=rippled_Peer_Finder_Active_Inbound_Peers" | jq '.data.result'
467+
468+
# RPC request counter
469+
curl -s "$PROM/api/v1/query?query=rippled_rpc_requests" | jq '.data.result'
470+
471+
# State accounting
472+
curl -s "$PROM/api/v1/query?query=rippled_State_Accounting_Full_duration" | jq '.data.result'
473+
474+
# Overlay traffic
475+
curl -s "$PROM/api/v1/query?query=rippled_total_Bytes_In" | jq '.data.result'
476+
```
477+
478+
Key StatsD metrics (prefix `rippled_`):
479+
480+
| Metric | Type | Source |
481+
| ------------------------------------- | --------- | ----------------------------------------- |
482+
| `LedgerMaster_Validated_Ledger_Age` | gauge | LedgerMaster.h:373 |
483+
| `LedgerMaster_Published_Ledger_Age` | gauge | LedgerMaster.h:374 |
484+
| `State_Accounting_{Mode}_duration` | gauge | NetworkOPs.cpp:774 |
485+
| `State_Accounting_{Mode}_transitions` | gauge | NetworkOPs.cpp:780 |
486+
| `Peer_Finder_Active_Inbound_Peers` | gauge | PeerfinderManager.cpp:214 |
487+
| `Peer_Finder_Active_Outbound_Peers` | gauge | PeerfinderManager.cpp:215 |
488+
| `Overlay_Peer_Disconnects` | gauge | OverlayImpl.h:557 |
489+
| `job_count` | gauge | JobQueue.cpp:26 |
490+
| `rpc_requests` | counter | ServerHandler.cpp:108 |
491+
| `rpc_time` | histogram | ServerHandler.cpp:110 |
492+
| `rpc_size` | histogram | ServerHandler.cpp:109 |
493+
| `ios_latency` | histogram | Application.cpp:438 |
494+
| `pathfind_fast` | histogram | PathRequests.h:23 |
495+
| `pathfind_full` | histogram | PathRequests.h:24 |
496+
| `ledger_fetches` | counter | InboundLedgers.cpp:44 |
497+
| `ledger_history_mismatch` | counter | LedgerHistory.cpp:16 |
498+
| `warn` | counter | Logic.h:33 |
499+
| `drop` | counter | Logic.h:34 |
500+
| `{category}_Bytes_In/Out` | gauge | OverlayImpl.h:535 (57 traffic categories) |
501+
| `{category}_Messages_In/Out` | gauge | OverlayImpl.h:535 (57 traffic categories) |
502+
437503
### Grafana
438504

439505
Open http://localhost:3000 (anonymous admin access enabled).
440506

441-
Pre-configured dashboards:
507+
Pre-configured dashboards (span-derived):
508+
509+
- **RPC Performance**: Request rates, latency percentiles by command, top commands, WebSocket rate
510+
- **Transaction Overview**: Transaction processing rates, apply duration, peer relay, failed tx rate
511+
- **Consensus Health**: Consensus round duration, proposer counts, mode tracking, accept heatmap
512+
- **Ledger Operations**: Build/validate/store rates and durations, TX apply metrics
513+
- **Peer Network**: Proposal/validation receive rates, trusted vs untrusted breakdown (requires `trace_peer=1`)
514+
515+
Pre-configured dashboards (StatsD):
442516

443-
- **RPC Performance**: Request rates, latency percentiles by command
444-
- **Transaction Overview**: Transaction processing rates and paths
445-
- **Consensus Health**: Consensus round duration and proposer counts
517+
- **Node Health (StatsD)**: Validated/published ledger age, operating mode, I/O latency, job queue
518+
- **Network Traffic (StatsD)**: Peer counts, disconnects, overlay traffic by category
519+
- **RPC & Pathfinding (StatsD)**: RPC request rate/time/size, pathfinding duration, resource warnings
446520

447521
Pre-configured datasources:
448522

docker/telemetry/docker-compose.yml

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,8 @@ services:
2323
ports:
2424
- "4317:4317" # OTLP gRPC
2525
- "4318:4318" # OTLP HTTP
26-
- "8889:8889" # Prometheus metrics (spanmetrics)
26+
- "8125:8125/udp" # StatsD UDP (beast::insight metrics)
27+
- "8889:8889" # Prometheus metrics (spanmetrics + statsd)
2728
- "13133:13133" # Health check
2829
volumes:
2930
- ./otel-collector-config.yaml:/etc/otel-collector-config.yaml:ro

0 commit comments

Comments
 (0)