@@ -182,7 +182,80 @@ gantt
182182
183183---
184184
185- ## 6.7 Risk Assessment
185+ ## 6.7 Phase 6: StatsD Metrics Integration (Week 10)
186+
187+ ** Objective** : Bridge rippled's existing ` beast::insight ` StatsD metrics into the OpenTelemetry collection pipeline, exposing 300+ pre-existing metrics alongside span-derived RED metrics in Prometheus/Grafana.
188+
189+ ### Background
190+
191+ rippled has a mature metrics framework (` beast::insight ` ) that emits StatsD-format metrics over UDP. These metrics cover node health, peer networking, RPC performance, job queue, and overlay traffic — data that ** does not** overlap with the span-based instrumentation from Phases 1-5. By adding a StatsD receiver to the OTel Collector, both metric sources converge in Prometheus.
192+
193+ ### Metric Inventory
194+
195+ | Category | Group | Type | Count | Key Metrics |
196+ | --------------- | ------------------ | ------------- | ---------- | ------------------------------------------------------ |
197+ | Node State | ` State_Accounting ` | Gauge | 10 | ` *_duration ` , ` *_transitions ` per operating mode |
198+ | Ledger | ` LedgerMaster ` | Gauge | 2 | ` Validated_Ledger_Age ` , ` Published_Ledger_Age ` |
199+ | Ledger Fetch | — | Counter | 1 | ` ledger_fetches ` |
200+ | Ledger History | ` ledger.history ` | Counter | 1 | ` mismatch ` |
201+ | RPC | ` rpc ` | Counter+Event | 3 | ` requests ` , ` time ` (histogram), ` size ` (histogram) |
202+ | Job Queue | — | Gauge+Event | 1 + 2×N | ` job_count ` , per-job ` {name} ` and ` {name}_q ` |
203+ | Peer Finder | ` Peer_Finder ` | Gauge | 2 | ` Active_Inbound_Peers ` , ` Active_Outbound_Peers ` |
204+ | Overlay | ` Overlay ` | Gauge | 1 | ` Peer_Disconnects ` |
205+ | Overlay Traffic | per-category | Gauge | 4×57 = 228 | ` Bytes_In/Out ` , ` Messages_In/Out ` per traffic category |
206+ | Pathfinding | — | Event | 2 | ` pathfind_fast ` , ` pathfind_full ` (histograms) |
207+ | I/O | — | Event | 1 | ` ios_latency ` (histogram) |
208+ | Resource Mgr | — | Meter | 2 | ` warn ` , ` drop ` (rate counters) |
209+ | Caches | per-cache | Gauge | 2×N | ` {cache}.size ` , ` {cache}.hit_rate ` |
210+
211+ ** Total** : ~ 255+ unique metrics (plus dynamic job-type and cache metrics)
212+
213+ ### Tasks
214+
215+ | Task | Description | Effort | Risk |
216+ | ---- | --------------------------------------------------------------------------------------------------------------- | ------ | ---- |
217+ | 6.1 | ** DEFERRED** Fix Meter wire format (` \|m ` → ` \|c ` ) in StatsDCollector.cpp — breaking change, tracked separately | 0.5d | Low |
218+ | 6.2 | Add ` statsd ` receiver to OTel Collector config | 0.5d | Low |
219+ | 6.3 | Expose UDP port 8125 in docker-compose.yml | 0.1d | Low |
220+ | 6.4 | Add ` [insight] ` config to integration test node configs | 0.5d | Low |
221+ | 6.5 | Create "Node Health" Grafana dashboard (8 panels) | 1d | Low |
222+ | 6.6 | Create "Network Traffic" Grafana dashboard (8 panels) | 1d | Low |
223+ | 6.7 | Create "RPC & Pathfinding (StatsD)" Grafana dashboard (8 panels) | 1d | Low |
224+ | 6.8 | Update integration test to verify StatsD metrics in Prometheus | 0.5d | Low |
225+ | 6.9 | Update TESTING.md and telemetry-runbook.md | 0.5d | Low |
226+
227+ ** Total Effort** : 5.6 days
228+
229+ ### Wire Format Fix (Task 6.1) — DEFERRED
230+
231+ The ` StatsDMeterImpl ` in ` StatsDCollector.cpp:706 ` sends metrics with ` |m ` suffix, which is non-standard StatsD. The OTel StatsD receiver silently drops these. Fix: change ` |m ` to ` |c ` (counter), which is semantically correct since meters are increment-only counters. Only 2 metrics are affected (` warn ` , ` drop ` in Resource Manager).
232+
233+ ** Status** : Deferred as a separate change — this is a breaking change for any StatsD backend that previously consumed the custom ` |m ` type. The Resource Warnings and Resource Drops dashboard panels will show no data until this fix is applied.
234+
235+ ### New Grafana Dashboards
236+
237+ ** Node Health** (` statsd-node-health.json ` , uid: ` rippled-statsd-node-health ` ):
238+
239+ - Validated/Published Ledger Age, Operating Mode Duration/Transitions, I/O Latency, Job Queue Depth, Ledger Fetch Rate, Ledger History Mismatches
240+
241+ ** Network Traffic** (` statsd-network-traffic.json ` , uid: ` rippled-statsd-network ` ):
242+
243+ - Active Inbound/Outbound Peers, Peer Disconnects, Total Bytes/Messages In/Out, Transaction/Proposal/Validation Traffic, Top Traffic Categories
244+
245+ ** RPC & Pathfinding (StatsD)** (` statsd-rpc-pathfinding.json ` , uid: ` rippled-statsd-rpc ` ):
246+
247+ - RPC Request Rate, Response Time p95/p50, Response Size p95/p50, Pathfinding Fast/Full Duration, Resource Warnings/Drops, Response Time Heatmap
248+
249+ ### Exit Criteria
250+
251+ - [ ] StatsD metrics visible in Prometheus (` curl localhost:9090/api/v1/query?query=rippled_LedgerMaster_Validated_Ledger_Age ` )
252+ - [ ] All 3 new Grafana dashboards load without errors
253+ - [ ] Integration test verifies at least core StatsD metrics (ledger age, peer counts, RPC requests)
254+ - [ ] ~~ Meter metrics (` warn ` , ` drop ` ) flow correctly after ` |m ` → ` |c ` fix~~ — DEFERRED (breaking change, tracked separately)
255+
256+ ---
257+
258+ ## 6.9 Risk Assessment
186259
187260``` mermaid
188261quadrantChart
@@ -213,7 +286,7 @@ quadrantChart
213286
214287---
215288
216- ## 6.8 Success Metrics
289+ ## 6.10 Success Metrics
217290
218291| Metric | Target | Measurement |
219292| ------------------------ | ------------------------------ | --------------------- |
@@ -226,7 +299,7 @@ quadrantChart
226299
227300---
228301
229- ## 6.9 Effort Summary
302+ ## 6.11 Effort Summary
230303
231304<div align =" center " >
232305
@@ -257,11 +330,11 @@ pie showData
257330
258331---
259332
260- ## 6.10 Quick Wins and Crawl-Walk-Run Strategy
333+ ## 6.12 Quick Wins and Crawl-Walk-Run Strategy
261334
262335This section outlines a prioritized approach to maximize ROI with minimal initial investment.
263336
264- ### 6.10 .1 Crawl-Walk-Run Overview
337+ ### 6.12 .1 Crawl-Walk-Run Overview
265338
266339<div align =" center " >
267340
@@ -300,7 +373,7 @@ flowchart TB
300373
301374</div >
302375
303- ### 6.10 .2 Quick Wins (Immediate Value)
376+ ### 6.12 .2 Quick Wins (Immediate Value)
304377
305378| Quick Win | Effort | Value | When to Deploy |
306379| ------------------------------ | -------- | ------ | -------------- |
@@ -310,7 +383,7 @@ flowchart TB
310383| ** Transaction Submit Tracing** | 1 day | High | Week 3 |
311384| ** Consensus Round Duration** | 1 day | Medium | Week 6 |
312385
313- ### 6.10 .3 CRAWL Phase (Weeks 1-2)
386+ ### 6.12 .3 CRAWL Phase (Weeks 1-2)
314387
315388** Goal** : Get basic tracing working with minimal code changes.
316389
@@ -330,7 +403,7 @@ flowchart TB
330403- No cross-node complexity
331404- Single file modification to existing code
332405
333- ### 6.10 .4 WALK Phase (Weeks 3-5)
406+ ### 6.12 .4 WALK Phase (Weeks 3-5)
334407
335408** Goal** : Add transaction lifecycle tracing across nodes.
336409
@@ -349,7 +422,7 @@ flowchart TB
349422- Moderate complexity (requires context propagation)
350423- High value for debugging transaction issues
351424
352- ### 6.10 .5 RUN Phase (Weeks 6-9)
425+ ### 6.12 .5 RUN Phase (Weeks 6-9)
353426
354427** Goal** : Full observability including consensus.
355428
@@ -368,7 +441,7 @@ flowchart TB
368441- Requires thorough testing
369442- Lower relative value (consensus issues are rarer)
370443
371- ### 6.10 .6 ROI Prioritization Matrix
444+ ### 6.12 .6 ROI Prioritization Matrix
372445
373446``` mermaid
374447quadrantChart
@@ -390,11 +463,11 @@ quadrantChart
390463
391464---
392465
393- ## 6.11 Definition of Done
466+ ## 6.13 Definition of Done
394467
395468Clear, measurable criteria for each phase.
396469
397- ### 6.11 .1 Phase 1: Core Infrastructure
470+ ### 6.13 .1 Phase 1: Core Infrastructure
398471
399472| Criterion | Measurement | Target |
400473| --------------- | ---------------------------------------------------------- | ---------------------------- |
@@ -406,7 +479,7 @@ Clear, measurable criteria for each phase.
406479
407480** Definition of Done** : All criteria met, PR merged, no regressions in CI.
408481
409- ### 6.11 .2 Phase 2: RPC Tracing
482+ ### 6.13 .2 Phase 2: RPC Tracing
410483
411484| Criterion | Measurement | Target |
412485| ------------------ | ---------------------------------- | -------------------------- |
@@ -418,7 +491,7 @@ Clear, measurable criteria for each phase.
418491
419492** Definition of Done** : RPC traces visible in Jaeger/Tempo for all commands, dashboard shows latency distribution.
420493
421- ### 6.11 .3 Phase 3: Transaction Tracing
494+ ### 6.13 .3 Phase 3: Transaction Tracing
422495
423496| Criterion | Measurement | Target |
424497| ---------------- | ------------------------------- | ---------------------------------- |
@@ -430,7 +503,7 @@ Clear, measurable criteria for each phase.
430503
431504** Definition of Done** : Transaction traces span 3+ nodes in test network, performance within bounds.
432505
433- ### 6.11 .4 Phase 4: Consensus Tracing
506+ ### 6.13 .4 Phase 4: Consensus Tracing
434507
435508| Criterion | Measurement | Target |
436509| -------------------- | ----------------------------- | ------------------------- |
@@ -442,7 +515,7 @@ Clear, measurable criteria for each phase.
442515
443516** Definition of Done** : Consensus rounds fully traceable, no impact on consensus timing.
444517
445- ### 6.11 .5 Phase 5: Production Deployment
518+ ### 6.13 .5 Phase 5: Production Deployment
446519
447520| Criterion | Measurement | Target |
448521| ------------ | ---------------------------- | -------------------------- |
@@ -455,7 +528,7 @@ Clear, measurable criteria for each phase.
455528
456529** Definition of Done** : Telemetry running in production, operators trained, alerts active.
457530
458- ### 6.11 .6 Success Metrics Summary
531+ ### 6.13 .6 Success Metrics Summary
459532
460533| Phase | Primary Metric | Secondary Metric | Deadline |
461534| ------- | ---------------------- | --------------------------- | ------------- |
@@ -467,7 +540,7 @@ Clear, measurable criteria for each phase.
467540
468541---
469542
470- ## 6.12 Recommended Implementation Order
543+ ## 6.14 Recommended Implementation Order
471544
472545Based on ROI analysis, implement in this exact order:
473546
0 commit comments