XRPLF · pratikmankawde · Feb 27, 2026
diff --git a/OpenTelemetryPlan/06-implementation-phases.md b/OpenTelemetryPlan/06-implementation-phases.md
@@ -182,7 +182,80 @@ gantt
 
 ---
 
-## 6.7 Risk Assessment
+## 6.7 Phase 6: StatsD Metrics Integration (Week 10)
+
+**Objective**: Bridge rippled's existing `beast::insight` StatsD metrics into the OpenTelemetry collection pipeline, exposing 300+ pre-existing metrics alongside span-derived RED metrics in Prometheus/Grafana.
+
+### Background
+
+rippled has a mature metrics framework (`beast::insight`) that emits StatsD-format metrics over UDP. These metrics cover node health, peer networking, RPC performance, job queue, and overlay traffic — data that **does not** overlap with the span-based instrumentation from Phases 1-5. By adding a StatsD receiver to the OTel Collector, both metric sources converge in Prometheus.
+
+### Metric Inventory
+
+| Category        | Group              | Type          | Count      | Key Metrics                                            |
+| --------------- | ------------------ | ------------- | ---------- | ------------------------------------------------------ |
+| Node State      | `State_Accounting` | Gauge         | 10         | `*_duration`, `*_transitions` per operating mode       |
+| Ledger          | `LedgerMaster`     | Gauge         | 2          | `Validated_Ledger_Age`, `Published_Ledger_Age`         |
+| Ledger Fetch    | —                  | Counter       | 1          | `ledger_fetches`                                       |
+| Ledger History  | `ledger.history`   | Counter       | 1          | `mismatch`                                             |
+| RPC             | `rpc`              | Counter+Event | 3          | `requests`, `time` (histogram), `size` (histogram)     |
+| Job Queue       | —                  | Gauge+Event   | 1 + 2×N    | `job_count`, per-job `{name}` and `{name}_q`           |
+| Peer Finder     | `Peer_Finder`      | Gauge         | 2          | `Active_Inbound_Peers`, `Active_Outbound_Peers`        |
+| Overlay         | `Overlay`          | Gauge         | 1          | `Peer_Disconnects`                                     |
+| Overlay Traffic | per-category       | Gauge         | 4×57 = 228 | `Bytes_In/Out`, `Messages_In/Out` per traffic category |
+| Pathfinding     | —                  | Event         | 2          | `pathfind_fast`, `pathfind_full` (histograms)          |
+| I/O             | —                  | Event         | 1          | `ios_latency` (histogram)                              |
+| Resource Mgr    | —                  | Meter         | 2          | `warn`, `drop` (rate counters)                         |
+| Caches          | per-cache          | Gauge         | 2×N        | `{cache}.size`, `{cache}.hit_rate`                     |
+
+**Total**: ~255+ unique metrics (plus dynamic job-type and cache metrics)
+
+### Tasks
+
+| Task | Description                                                                                                     | Effort | Risk |
+| ---- | --------------------------------------------------------------------------------------------------------------- | ------ | ---- |
+| 6.1  | **DEFERRED** Fix Meter wire format (`\|m` → `\|c`) in StatsDCollector.cpp — breaking change, tracked separately | 0.5d   | Low  |
+| 6.2  | Add `statsd` receiver to OTel Collector config                                                                  | 0.5d   | Low  |
+| 6.3  | Expose UDP port 8125 in docker-compose.yml                                                                      | 0.1d   | Low  |
+| 6.4  | Add `[insight]` config to integration test node configs                                                         | 0.5d   | Low  |
+| 6.5  | Create "Node Health" Grafana dashboard (8 panels)                                                               | 1d     | Low  |
+| 6.6  | Create "Network Traffic" Grafana dashboard (8 panels)                                                           | 1d     | Low  |
+| 6.7  | Create "RPC & Pathfinding (StatsD)" Grafana dashboard (8 panels)                                                | 1d     | Low  |
+| 6.8  | Update integration test to verify StatsD metrics in Prometheus                                                  | 0.5d   | Low  |
+| 6.9  | Update TESTING.md and telemetry-runbook.md                                                                      | 0.5d   | Low  |
+
+**Total Effort**: 5.6 days
+
+### Wire Format Fix (Task 6.1) — DEFERRED
+
+The `StatsDMeterImpl` in `StatsDCollector.cpp:706` sends metrics with `|m` suffix, which is non-standard StatsD. The OTel StatsD receiver silently drops these. Fix: change `|m` to `|c` (counter), which is semantically correct since meters are increment-only counters. Only 2 metrics are affected (`warn`, `drop` in Resource Manager).
+
+**Status**: Deferred as a separate change — this is a breaking change for any StatsD backend that previously consumed the custom `|m` type. The Resource Warnings and Resource Drops dashboard panels will show no data until this fix is applied.
+
+### New Grafana Dashboards
+
+**Node Health** (`statsd-node-health.json`, uid: `rippled-statsd-node-health`):
+
+- Validated/Published Ledger Age, Operating Mode Duration/Transitions, I/O Latency, Job Queue Depth, Ledger Fetch Rate, Ledger History Mismatches
+
+**Network Traffic** (`statsd-network-traffic.json`, uid: `rippled-statsd-network`):
+
+- Active Inbound/Outbound Peers, Peer Disconnects, Total Bytes/Messages In/Out, Transaction/Proposal/Validation Traffic, Top Traffic Categories
+
+**RPC & Pathfinding (StatsD)** (`statsd-rpc-pathfinding.json`, uid: `rippled-statsd-rpc`):
+
+- RPC Request Rate, Response Time p95/p50, Response Size p95/p50, Pathfinding Fast/Full Duration, Resource Warnings/Drops, Response Time Heatmap
+
+### Exit Criteria
+
+- [ ] StatsD metrics visible in Prometheus (`curl localhost:9090/api/v1/query?query=rippled_LedgerMaster_Validated_Ledger_Age`)
+- [ ] All 3 new Grafana dashboards load without errors
+- [ ] Integration test verifies at least core StatsD metrics (ledger age, peer counts, RPC requests)
+- [ ] ~~Meter metrics (`warn`, `drop`) flow correctly after `|m` → `|c` fix~~ — DEFERRED (breaking change, tracked separately)
+
+---
+
+## 6.9 Risk Assessment
 
 ```mermaid
 quadrantChart
@@ -213,7 +286,7 @@ quadrantChart
 
 ---
 
-## 6.8 Success Metrics
+## 6.10 Success Metrics
 
 | Metric                   | Target                         | Measurement           |
 | ------------------------ | ------------------------------ | --------------------- |
@@ -226,7 +299,7 @@ quadrantChart
 
 ---
 
-## 6.9 Effort Summary
+## 6.11 Effort Summary
 
 <div align="center">
 
@@ -257,11 +330,11 @@ pie showData
 
 ---
 
-## 6.10 Quick Wins and Crawl-Walk-Run Strategy
+## 6.12 Quick Wins and Crawl-Walk-Run Strategy
 
 This section outlines a prioritized approach to maximize ROI with minimal initial investment.
 
-### 6.10.1 Crawl-Walk-Run Overview
+### 6.12.1 Crawl-Walk-Run Overview
 
 <div align="center">
 
@@ -300,7 +373,7 @@ flowchart TB
 
 </div>
 
-### 6.10.2 Quick Wins (Immediate Value)
+### 6.12.2 Quick Wins (Immediate Value)
 
 | Quick Win                      | Effort   | Value  | When to Deploy |
 | ------------------------------ | -------- | ------ | -------------- |
@@ -310,7 +383,7 @@ flowchart TB
 | **Transaction Submit Tracing** | 1 day    | High   | Week 3         |
 | **Consensus Round Duration**   | 1 day    | Medium | Week 6         |
 
-### 6.10.3 CRAWL Phase (Weeks 1-2)
+### 6.12.3 CRAWL Phase (Weeks 1-2)
 
 **Goal**: Get basic tracing working with minimal code changes.
 
@@ -330,7 +403,7 @@ flowchart TB
 - No cross-node complexity
 - Single file modification to existing code
 
-### 6.10.4 WALK Phase (Weeks 3-5)
+### 6.12.4 WALK Phase (Weeks 3-5)
 
 **Goal**: Add transaction lifecycle tracing across nodes.
 
@@ -349,7 +422,7 @@ flowchart TB
 - Moderate complexity (requires context propagation)
 - High value for debugging transaction issues
 
-### 6.10.5 RUN Phase (Weeks 6-9)
+### 6.12.5 RUN Phase (Weeks 6-9)
 
 **Goal**: Full observability including consensus.
 
@@ -368,7 +441,7 @@ flowchart TB
 - Requires thorough testing
 - Lower relative value (consensus issues are rarer)
 
-### 6.10.6 ROI Prioritization Matrix
+### 6.12.6 ROI Prioritization Matrix
 
 ```mermaid
 quadrantChart
@@ -390,11 +463,11 @@ quadrantChart
 
 ---
 
-## 6.11 Definition of Done
+## 6.13 Definition of Done
 
 Clear, measurable criteria for each phase.
 
-### 6.11.1 Phase 1: Core Infrastructure
+### 6.13.1 Phase 1: Core Infrastructure
 
 | Criterion       | Measurement                                                | Target                       |
 | --------------- | ---------------------------------------------------------- | ---------------------------- |
@@ -406,7 +479,7 @@ Clear, measurable criteria for each phase.
 
 **Definition of Done**: All criteria met, PR merged, no regressions in CI.
 
-### 6.11.2 Phase 2: RPC Tracing
+### 6.13.2 Phase 2: RPC Tracing
 
 | Criterion          | Measurement                        | Target                     |
 | ------------------ | ---------------------------------- | -------------------------- |
@@ -418,7 +491,7 @@ Clear, measurable criteria for each phase.
 
 **Definition of Done**: RPC traces visible in Jaeger/Tempo for all commands, dashboard shows latency distribution.
 
-### 6.11.3 Phase 3: Transaction Tracing
+### 6.13.3 Phase 3: Transaction Tracing
 
 | Criterion        | Measurement                     | Target                             |
 | ---------------- | ------------------------------- | ---------------------------------- |
@@ -430,7 +503,7 @@ Clear, measurable criteria for each phase.
 
 **Definition of Done**: Transaction traces span 3+ nodes in test network, performance within bounds.
 
-### 6.11.4 Phase 4: Consensus Tracing
+### 6.13.4 Phase 4: Consensus Tracing
 
 | Criterion            | Measurement                   | Target                    |
 | -------------------- | ----------------------------- | ------------------------- |
@@ -442,7 +515,7 @@ Clear, measurable criteria for each phase.
 
 **Definition of Done**: Consensus rounds fully traceable, no impact on consensus timing.
 
-### 6.11.5 Phase 5: Production Deployment
+### 6.13.5 Phase 5: Production Deployment
 
 | Criterion    | Measurement                  | Target                     |
 | ------------ | ---------------------------- | -------------------------- |
@@ -455,7 +528,7 @@ Clear, measurable criteria for each phase.
 
 **Definition of Done**: Telemetry running in production, operators trained, alerts active.
 
-### 6.11.6 Success Metrics Summary
+### 6.13.6 Success Metrics Summary
 
 | Phase   | Primary Metric         | Secondary Metric            | Deadline      |
 | ------- | ---------------------- | --------------------------- | ------------- |
@@ -467,7 +540,7 @@ Clear, measurable criteria for each phase.
 
 ---
 
-## 6.12 Recommended Implementation Order
+## 6.14 Recommended Implementation Order
 
 Based on ROI analysis, implement in this exact order:
 

diff --git a/docker/telemetry/TESTING.md b/docker/telemetry/TESTING.md
@@ -370,7 +370,7 @@ See the "Verification Queries" section below.
 
 ## Expected Span Catalog
 
-All 12 production span names instrumented across Phases 2-4:
+All 16 production span names instrumented across Phases 2-5:
 
 | Span Name                   | Source File           | Phase | Key Attributes                                             | How to Trigger            |
 | --------------------------- | --------------------- | ----- | ---------------------------------------------------------- | ------------------------- |
@@ -380,10 +380,16 @@ All 12 production span names instrumented across Phases 2-4:
 | `rpc.command.<name>`        | RPCHandler.cpp:161    | 2     | `xrpl.rpc.command`, `xrpl.rpc.version`, `xrpl.rpc.role`    | Any RPC command           |
 | `tx.process`                | NetworkOPs.cpp:1227   | 3     | `xrpl.tx.hash`, `xrpl.tx.local`, `xrpl.tx.path`            | Submit transaction        |
 | `tx.receive`                | PeerImp.cpp:1273      | 3     | `xrpl.peer.id`                                             | Peer relays transaction   |
+| `tx.apply`                  | BuildLedger.cpp:88    | 5     | `xrpl.ledger.tx_count`, `xrpl.ledger.tx_failed`            | Ledger close (tx set)     |
 | `consensus.proposal.send`   | RCLConsensus.cpp:177  | 4     | `xrpl.consensus.round`                                     | Consensus proposing phase |
 | `consensus.ledger_close`    | RCLConsensus.cpp:282  | 4     | `xrpl.consensus.ledger.seq`, `xrpl.consensus.mode`         | Ledger close event        |
 | `consensus.accept`          | RCLConsensus.cpp:395  | 4     | `xrpl.consensus.proposers`, `xrpl.consensus.round_time_ms` | Ledger accepted           |
 | `consensus.validation.send` | RCLConsensus.cpp:753  | 4     | `xrpl.consensus.ledger.seq`, `xrpl.consensus.proposing`    | Validation sent           |
+| `ledger.build`              | BuildLedger.cpp:31    | 5     | `xrpl.ledger.seq`                                          | Ledger build              |
+| `ledger.validate`           | LedgerMaster.cpp:915  | 5     | `xrpl.ledger.seq`, `xrpl.ledger.validations`               | Ledger validated          |
+| `ledger.store`              | LedgerMaster.cpp:409  | 5     | `xrpl.ledger.seq`                                          | Ledger stored             |
+| `peer.proposal.receive`     | PeerImp.cpp:1667      | 5     | `xrpl.peer.id`, `xrpl.peer.proposal.trusted`               | Peer sends proposal       |
+| `peer.validation.receive`   | PeerImp.cpp:2264      | 5     | `xrpl.peer.id`, `xrpl.peer.validation.trusted`             | Peer sends validation     |
 
 ---
 
@@ -405,9 +411,11 @@ curl -s "$JAEGER/api/services/rippled/operations" | jq '.data'
 # Query traces by operation
 for op in "rpc.request" "rpc.process" \
           "rpc.command.server_info" "rpc.command.server_state" "rpc.command.ledger" \
-          "tx.process" "tx.receive" \
+          "tx.process" "tx.receive" "tx.apply" \
           "consensus.proposal.send" "consensus.ledger_close" \
-          "consensus.accept" "consensus.validation.send"; do
+          "consensus.accept" "consensus.validation.send" \
+          "ledger.build" "ledger.validate" "ledger.store" \
+          "peer.proposal.receive" "peer.validation.receive"; do
   count=$(curl -s "$JAEGER/api/traces?service=rippled&operation=$op&limit=5&lookback=1h" \
     | jq '.data | length')
   printf "%-35s %s traces\n" "$op" "$count"
@@ -434,15 +442,81 @@ curl -s "$PROM/api/v1/query?query=traces_span_metrics_calls_total{span_name=~\"r
   | jq '.data.result[] | {command: .metric["xrpl.rpc.command"], count: .value[1]}'
 ```
 
+### StatsD Metrics (beast::insight)
+
+rippled's built-in `beast::insight` framework emits StatsD metrics over UDP to the OTel Collector
+on port 8125. These appear in Prometheus alongside spanmetrics.
+
+Requires `[insight]` config in `xrpld.cfg`:
+
+```ini
+[insight]
+server=statsd
+address=127.0.0.1:8125
+prefix=rippled
+```
+
+Verify StatsD metrics in Prometheus:
+
+```bash
+# Ledger age gauge
+curl -s "$PROM/api/v1/query?query=rippled_LedgerMaster_Validated_Ledger_Age" | jq '.data.result'
+
+# Peer counts
+curl -s "$PROM/api/v1/query?query=rippled_Peer_Finder_Active_Inbound_Peers" | jq '.data.result'
+
+# RPC request counter
+curl -s "$PROM/api/v1/query?query=rippled_rpc_requests" | jq '.data.result'
+
+# State accounting
+curl -s "$PROM/api/v1/query?query=rippled_State_Accounting_Full_duration" | jq '.data.result'
+
+# Overlay traffic
+curl -s "$PROM/api/v1/query?query=rippled_total_Bytes_In" | jq '.data.result'
+```
+
+Key StatsD metrics (prefix `rippled_`):
+
+| Metric                                | Type      | Source                                    |
+| ------------------------------------- | --------- | ----------------------------------------- |
+| `LedgerMaster_Validated_Ledger_Age`   | gauge     | LedgerMaster.h:373                        |
+| `LedgerMaster_Published_Ledger_Age`   | gauge     | LedgerMaster.h:374                        |
+| `State_Accounting_{Mode}_duration`    | gauge     | NetworkOPs.cpp:774                        |
+| `State_Accounting_{Mode}_transitions` | gauge     | NetworkOPs.cpp:780                        |
+| `Peer_Finder_Active_Inbound_Peers`    | gauge     | PeerfinderManager.cpp:214                 |
+| `Peer_Finder_Active_Outbound_Peers`   | gauge     | PeerfinderManager.cpp:215                 |
+| `Overlay_Peer_Disconnects`            | gauge     | OverlayImpl.h:557                         |
+| `job_count`                           | gauge     | JobQueue.cpp:26                           |
+| `rpc_requests`                        | counter   | ServerHandler.cpp:108                     |
+| `rpc_time`                            | histogram | ServerHandler.cpp:110                     |
+| `rpc_size`                            | histogram | ServerHandler.cpp:109                     |
+| `ios_latency`                         | histogram | Application.cpp:438                       |
+| `pathfind_fast`                       | histogram | PathRequests.h:23                         |
+| `pathfind_full`                       | histogram | PathRequests.h:24                         |
+| `ledger_fetches`                      | counter   | InboundLedgers.cpp:44                     |
+| `ledger_history_mismatch`             | counter   | LedgerHistory.cpp:16                      |
+| `warn`                                | counter   | Logic.h:33                                |
+| `drop`                                | counter   | Logic.h:34                                |
+| `{category}_Bytes_In/Out`             | gauge     | OverlayImpl.h:535 (57 traffic categories) |
+| `{category}_Messages_In/Out`          | gauge     | OverlayImpl.h:535 (57 traffic categories) |
+
 ### Grafana
 
 Open http://localhost:3000 (anonymous admin access enabled).
 
-Pre-configured dashboards:
+Pre-configured dashboards (span-derived):
+
+- **RPC Performance**: Request rates, latency percentiles by command, top commands, WebSocket rate
+- **Transaction Overview**: Transaction processing rates, apply duration, peer relay, failed tx rate
+- **Consensus Health**: Consensus round duration, proposer counts, mode tracking, accept heatmap
+- **Ledger Operations**: Build/validate/store rates and durations, TX apply metrics
+- **Peer Network**: Proposal/validation receive rates, trusted vs untrusted breakdown (requires `trace_peer=1`)
+
+Pre-configured dashboards (StatsD):
 
-- **RPC Performance**: Request rates, latency percentiles by command
-- **Transaction Overview**: Transaction processing rates and paths
-- **Consensus Health**: Consensus round duration and proposer counts
+- **Node Health (StatsD)**: Validated/published ledger age, operating mode, I/O latency, job queue
+- **Network Traffic (StatsD)**: Peer counts, disconnects, overlay traffic by category
+- **RPC & Pathfinding (StatsD)**: RPC request rate/time/size, pathfinding duration, resource warnings
 
 Pre-configured datasources:
 

diff --git a/docker/telemetry/docker-compose.yml b/docker/telemetry/docker-compose.yml
@@ -23,7 +23,8 @@ services:
     ports:
       - "4317:4317" # OTLP gRPC
       - "4318:4318" # OTLP HTTP
-      - "8889:8889" # Prometheus metrics (spanmetrics)
+      - "8125:8125/udp" # StatsD UDP (beast::insight metrics)
+      - "8889:8889" # Prometheus metrics (spanmetrics + statsd)
       - "13133:13133" # Health check
     volumes:
       - ./otel-collector-config.yaml:/etc/otel-collector-config.yaml:ro