Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
109 changes: 91 additions & 18 deletions OpenTelemetryPlan/06-implementation-phases.md
Original file line number Diff line number Diff line change
Expand Up @@ -182,7 +182,80 @@ gantt

---

## 6.7 Risk Assessment
## 6.7 Phase 6: StatsD Metrics Integration (Week 10)

**Objective**: Bridge rippled's existing `beast::insight` StatsD metrics into the OpenTelemetry collection pipeline, exposing 300+ pre-existing metrics alongside span-derived RED metrics in Prometheus/Grafana.

### Background

rippled has a mature metrics framework (`beast::insight`) that emits StatsD-format metrics over UDP. These metrics cover node health, peer networking, RPC performance, job queue, and overlay traffic — data that **does not** overlap with the span-based instrumentation from Phases 1-5. By adding a StatsD receiver to the OTel Collector, both metric sources converge in Prometheus.

### Metric Inventory

| Category | Group | Type | Count | Key Metrics |
| --------------- | ------------------ | ------------- | ---------- | ------------------------------------------------------ |
| Node State | `State_Accounting` | Gauge | 10 | `*_duration`, `*_transitions` per operating mode |
| Ledger | `LedgerMaster` | Gauge | 2 | `Validated_Ledger_Age`, `Published_Ledger_Age` |
| Ledger Fetch | — | Counter | 1 | `ledger_fetches` |
| Ledger History | `ledger.history` | Counter | 1 | `mismatch` |
| RPC | `rpc` | Counter+Event | 3 | `requests`, `time` (histogram), `size` (histogram) |
| Job Queue | — | Gauge+Event | 1 + 2×N | `job_count`, per-job `{name}` and `{name}_q` |
| Peer Finder | `Peer_Finder` | Gauge | 2 | `Active_Inbound_Peers`, `Active_Outbound_Peers` |
| Overlay | `Overlay` | Gauge | 1 | `Peer_Disconnects` |
| Overlay Traffic | per-category | Gauge | 4×57 = 228 | `Bytes_In/Out`, `Messages_In/Out` per traffic category |
| Pathfinding | — | Event | 2 | `pathfind_fast`, `pathfind_full` (histograms) |
| I/O | — | Event | 1 | `ios_latency` (histogram) |
| Resource Mgr | — | Meter | 2 | `warn`, `drop` (rate counters) |
| Caches | per-cache | Gauge | 2×N | `{cache}.size`, `{cache}.hit_rate` |

**Total**: ~255+ unique metrics (plus dynamic job-type and cache metrics)

### Tasks

| Task | Description | Effort | Risk |
| ---- | --------------------------------------------------------------------------------------------------------------- | ------ | ---- |
| 6.1 | **DEFERRED** Fix Meter wire format (`\|m` → `\|c`) in StatsDCollector.cpp — breaking change, tracked separately | 0.5d | Low |
| 6.2 | Add `statsd` receiver to OTel Collector config | 0.5d | Low |
| 6.3 | Expose UDP port 8125 in docker-compose.yml | 0.1d | Low |
| 6.4 | Add `[insight]` config to integration test node configs | 0.5d | Low |
| 6.5 | Create "Node Health" Grafana dashboard (8 panels) | 1d | Low |
| 6.6 | Create "Network Traffic" Grafana dashboard (8 panels) | 1d | Low |
| 6.7 | Create "RPC & Pathfinding (StatsD)" Grafana dashboard (8 panels) | 1d | Low |
| 6.8 | Update integration test to verify StatsD metrics in Prometheus | 0.5d | Low |
| 6.9 | Update TESTING.md and telemetry-runbook.md | 0.5d | Low |

**Total Effort**: 5.6 days

### Wire Format Fix (Task 6.1) — DEFERRED

The `StatsDMeterImpl` in `StatsDCollector.cpp:706` sends metrics with `|m` suffix, which is non-standard StatsD. The OTel StatsD receiver silently drops these. Fix: change `|m` to `|c` (counter), which is semantically correct since meters are increment-only counters. Only 2 metrics are affected (`warn`, `drop` in Resource Manager).

**Status**: Deferred as a separate change — this is a breaking change for any StatsD backend that previously consumed the custom `|m` type. The Resource Warnings and Resource Drops dashboard panels will show no data until this fix is applied.

### New Grafana Dashboards

**Node Health** (`statsd-node-health.json`, uid: `rippled-statsd-node-health`):

- Validated/Published Ledger Age, Operating Mode Duration/Transitions, I/O Latency, Job Queue Depth, Ledger Fetch Rate, Ledger History Mismatches

**Network Traffic** (`statsd-network-traffic.json`, uid: `rippled-statsd-network`):

- Active Inbound/Outbound Peers, Peer Disconnects, Total Bytes/Messages In/Out, Transaction/Proposal/Validation Traffic, Top Traffic Categories

**RPC & Pathfinding (StatsD)** (`statsd-rpc-pathfinding.json`, uid: `rippled-statsd-rpc`):

- RPC Request Rate, Response Time p95/p50, Response Size p95/p50, Pathfinding Fast/Full Duration, Resource Warnings/Drops, Response Time Heatmap

### Exit Criteria

- [ ] StatsD metrics visible in Prometheus (`curl localhost:9090/api/v1/query?query=rippled_LedgerMaster_Validated_Ledger_Age`)
- [ ] All 3 new Grafana dashboards load without errors
- [ ] Integration test verifies at least core StatsD metrics (ledger age, peer counts, RPC requests)
- [ ] ~~Meter metrics (`warn`, `drop`) flow correctly after `|m` → `|c` fix~~ — DEFERRED (breaking change, tracked separately)

---

## 6.9 Risk Assessment

```mermaid
quadrantChart
Expand Down Expand Up @@ -213,7 +286,7 @@ quadrantChart

---

## 6.8 Success Metrics
## 6.10 Success Metrics

| Metric | Target | Measurement |
| ------------------------ | ------------------------------ | --------------------- |
Expand All @@ -226,7 +299,7 @@ quadrantChart

---

## 6.9 Effort Summary
## 6.11 Effort Summary

<div align="center">

Expand Down Expand Up @@ -257,11 +330,11 @@ pie showData

---

## 6.10 Quick Wins and Crawl-Walk-Run Strategy
## 6.12 Quick Wins and Crawl-Walk-Run Strategy

This section outlines a prioritized approach to maximize ROI with minimal initial investment.

### 6.10.1 Crawl-Walk-Run Overview
### 6.12.1 Crawl-Walk-Run Overview

<div align="center">

Expand Down Expand Up @@ -300,7 +373,7 @@ flowchart TB

</div>

### 6.10.2 Quick Wins (Immediate Value)
### 6.12.2 Quick Wins (Immediate Value)

| Quick Win | Effort | Value | When to Deploy |
| ------------------------------ | -------- | ------ | -------------- |
Expand All @@ -310,7 +383,7 @@ flowchart TB
| **Transaction Submit Tracing** | 1 day | High | Week 3 |
| **Consensus Round Duration** | 1 day | Medium | Week 6 |

### 6.10.3 CRAWL Phase (Weeks 1-2)
### 6.12.3 CRAWL Phase (Weeks 1-2)

**Goal**: Get basic tracing working with minimal code changes.

Expand All @@ -330,7 +403,7 @@ flowchart TB
- No cross-node complexity
- Single file modification to existing code

### 6.10.4 WALK Phase (Weeks 3-5)
### 6.12.4 WALK Phase (Weeks 3-5)

**Goal**: Add transaction lifecycle tracing across nodes.

Expand All @@ -349,7 +422,7 @@ flowchart TB
- Moderate complexity (requires context propagation)
- High value for debugging transaction issues

### 6.10.5 RUN Phase (Weeks 6-9)
### 6.12.5 RUN Phase (Weeks 6-9)

**Goal**: Full observability including consensus.

Expand All @@ -368,7 +441,7 @@ flowchart TB
- Requires thorough testing
- Lower relative value (consensus issues are rarer)

### 6.10.6 ROI Prioritization Matrix
### 6.12.6 ROI Prioritization Matrix

```mermaid
quadrantChart
Expand All @@ -390,11 +463,11 @@ quadrantChart

---

## 6.11 Definition of Done
## 6.13 Definition of Done

Clear, measurable criteria for each phase.

### 6.11.1 Phase 1: Core Infrastructure
### 6.13.1 Phase 1: Core Infrastructure

| Criterion | Measurement | Target |
| --------------- | ---------------------------------------------------------- | ---------------------------- |
Expand All @@ -406,7 +479,7 @@ Clear, measurable criteria for each phase.

**Definition of Done**: All criteria met, PR merged, no regressions in CI.

### 6.11.2 Phase 2: RPC Tracing
### 6.13.2 Phase 2: RPC Tracing

| Criterion | Measurement | Target |
| ------------------ | ---------------------------------- | -------------------------- |
Expand All @@ -418,7 +491,7 @@ Clear, measurable criteria for each phase.

**Definition of Done**: RPC traces visible in Jaeger/Tempo for all commands, dashboard shows latency distribution.

### 6.11.3 Phase 3: Transaction Tracing
### 6.13.3 Phase 3: Transaction Tracing

| Criterion | Measurement | Target |
| ---------------- | ------------------------------- | ---------------------------------- |
Expand All @@ -430,7 +503,7 @@ Clear, measurable criteria for each phase.

**Definition of Done**: Transaction traces span 3+ nodes in test network, performance within bounds.

### 6.11.4 Phase 4: Consensus Tracing
### 6.13.4 Phase 4: Consensus Tracing

| Criterion | Measurement | Target |
| -------------------- | ----------------------------- | ------------------------- |
Expand All @@ -442,7 +515,7 @@ Clear, measurable criteria for each phase.

**Definition of Done**: Consensus rounds fully traceable, no impact on consensus timing.

### 6.11.5 Phase 5: Production Deployment
### 6.13.5 Phase 5: Production Deployment

| Criterion | Measurement | Target |
| ------------ | ---------------------------- | -------------------------- |
Expand All @@ -455,7 +528,7 @@ Clear, measurable criteria for each phase.

**Definition of Done**: Telemetry running in production, operators trained, alerts active.

### 6.11.6 Success Metrics Summary
### 6.13.6 Success Metrics Summary

| Phase | Primary Metric | Secondary Metric | Deadline |
| ------- | ---------------------- | --------------------------- | ------------- |
Expand All @@ -467,7 +540,7 @@ Clear, measurable criteria for each phase.

---

## 6.12 Recommended Implementation Order
## 6.14 Recommended Implementation Order

Based on ROI analysis, implement in this exact order:

Expand Down
88 changes: 81 additions & 7 deletions docker/telemetry/TESTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -370,7 +370,7 @@ See the "Verification Queries" section below.

## Expected Span Catalog

All 12 production span names instrumented across Phases 2-4:
All 16 production span names instrumented across Phases 2-5:

| Span Name | Source File | Phase | Key Attributes | How to Trigger |
| --------------------------- | --------------------- | ----- | ---------------------------------------------------------- | ------------------------- |
Expand All @@ -380,10 +380,16 @@ All 12 production span names instrumented across Phases 2-4:
| `rpc.command.<name>` | RPCHandler.cpp:161 | 2 | `xrpl.rpc.command`, `xrpl.rpc.version`, `xrpl.rpc.role` | Any RPC command |
| `tx.process` | NetworkOPs.cpp:1227 | 3 | `xrpl.tx.hash`, `xrpl.tx.local`, `xrpl.tx.path` | Submit transaction |
| `tx.receive` | PeerImp.cpp:1273 | 3 | `xrpl.peer.id` | Peer relays transaction |
| `tx.apply` | BuildLedger.cpp:88 | 5 | `xrpl.ledger.tx_count`, `xrpl.ledger.tx_failed` | Ledger close (tx set) |
| `consensus.proposal.send` | RCLConsensus.cpp:177 | 4 | `xrpl.consensus.round` | Consensus proposing phase |
| `consensus.ledger_close` | RCLConsensus.cpp:282 | 4 | `xrpl.consensus.ledger.seq`, `xrpl.consensus.mode` | Ledger close event |
| `consensus.accept` | RCLConsensus.cpp:395 | 4 | `xrpl.consensus.proposers`, `xrpl.consensus.round_time_ms` | Ledger accepted |
| `consensus.validation.send` | RCLConsensus.cpp:753 | 4 | `xrpl.consensus.ledger.seq`, `xrpl.consensus.proposing` | Validation sent |
| `ledger.build` | BuildLedger.cpp:31 | 5 | `xrpl.ledger.seq` | Ledger build |
| `ledger.validate` | LedgerMaster.cpp:915 | 5 | `xrpl.ledger.seq`, `xrpl.ledger.validations` | Ledger validated |
| `ledger.store` | LedgerMaster.cpp:409 | 5 | `xrpl.ledger.seq` | Ledger stored |
| `peer.proposal.receive` | PeerImp.cpp:1667 | 5 | `xrpl.peer.id`, `xrpl.peer.proposal.trusted` | Peer sends proposal |
| `peer.validation.receive` | PeerImp.cpp:2264 | 5 | `xrpl.peer.id`, `xrpl.peer.validation.trusted` | Peer sends validation |

---

Expand All @@ -405,9 +411,11 @@ curl -s "$JAEGER/api/services/rippled/operations" | jq '.data'
# Query traces by operation
for op in "rpc.request" "rpc.process" \
"rpc.command.server_info" "rpc.command.server_state" "rpc.command.ledger" \
"tx.process" "tx.receive" \
"tx.process" "tx.receive" "tx.apply" \
"consensus.proposal.send" "consensus.ledger_close" \
"consensus.accept" "consensus.validation.send"; do
"consensus.accept" "consensus.validation.send" \
"ledger.build" "ledger.validate" "ledger.store" \
"peer.proposal.receive" "peer.validation.receive"; do
count=$(curl -s "$JAEGER/api/traces?service=rippled&operation=$op&limit=5&lookback=1h" \
| jq '.data | length')
printf "%-35s %s traces\n" "$op" "$count"
Expand All @@ -434,15 +442,81 @@ curl -s "$PROM/api/v1/query?query=traces_span_metrics_calls_total{span_name=~\"r
| jq '.data.result[] | {command: .metric["xrpl.rpc.command"], count: .value[1]}'
```

### StatsD Metrics (beast::insight)

rippled's built-in `beast::insight` framework emits StatsD metrics over UDP to the OTel Collector
on port 8125. These appear in Prometheus alongside spanmetrics.

Requires `[insight]` config in `xrpld.cfg`:

```ini
[insight]
server=statsd
address=127.0.0.1:8125
prefix=rippled
```

Verify StatsD metrics in Prometheus:

```bash
# Ledger age gauge
curl -s "$PROM/api/v1/query?query=rippled_LedgerMaster_Validated_Ledger_Age" | jq '.data.result'

# Peer counts
curl -s "$PROM/api/v1/query?query=rippled_Peer_Finder_Active_Inbound_Peers" | jq '.data.result'

# RPC request counter
curl -s "$PROM/api/v1/query?query=rippled_rpc_requests" | jq '.data.result'

# State accounting
curl -s "$PROM/api/v1/query?query=rippled_State_Accounting_Full_duration" | jq '.data.result'

# Overlay traffic
curl -s "$PROM/api/v1/query?query=rippled_total_Bytes_In" | jq '.data.result'
```

Key StatsD metrics (prefix `rippled_`):

| Metric | Type | Source |
| ------------------------------------- | --------- | ----------------------------------------- |
| `LedgerMaster_Validated_Ledger_Age` | gauge | LedgerMaster.h:373 |
| `LedgerMaster_Published_Ledger_Age` | gauge | LedgerMaster.h:374 |
| `State_Accounting_{Mode}_duration` | gauge | NetworkOPs.cpp:774 |
| `State_Accounting_{Mode}_transitions` | gauge | NetworkOPs.cpp:780 |
| `Peer_Finder_Active_Inbound_Peers` | gauge | PeerfinderManager.cpp:214 |
| `Peer_Finder_Active_Outbound_Peers` | gauge | PeerfinderManager.cpp:215 |
| `Overlay_Peer_Disconnects` | gauge | OverlayImpl.h:557 |
| `job_count` | gauge | JobQueue.cpp:26 |
| `rpc_requests` | counter | ServerHandler.cpp:108 |
| `rpc_time` | histogram | ServerHandler.cpp:110 |
| `rpc_size` | histogram | ServerHandler.cpp:109 |
| `ios_latency` | histogram | Application.cpp:438 |
| `pathfind_fast` | histogram | PathRequests.h:23 |
| `pathfind_full` | histogram | PathRequests.h:24 |
| `ledger_fetches` | counter | InboundLedgers.cpp:44 |
| `ledger_history_mismatch` | counter | LedgerHistory.cpp:16 |
| `warn` | counter | Logic.h:33 |
| `drop` | counter | Logic.h:34 |
| `{category}_Bytes_In/Out` | gauge | OverlayImpl.h:535 (57 traffic categories) |
| `{category}_Messages_In/Out` | gauge | OverlayImpl.h:535 (57 traffic categories) |

### Grafana

Open http://localhost:3000 (anonymous admin access enabled).

Pre-configured dashboards:
Pre-configured dashboards (span-derived):

- **RPC Performance**: Request rates, latency percentiles by command, top commands, WebSocket rate
- **Transaction Overview**: Transaction processing rates, apply duration, peer relay, failed tx rate
- **Consensus Health**: Consensus round duration, proposer counts, mode tracking, accept heatmap
- **Ledger Operations**: Build/validate/store rates and durations, TX apply metrics
- **Peer Network**: Proposal/validation receive rates, trusted vs untrusted breakdown (requires `trace_peer=1`)

Pre-configured dashboards (StatsD):

- **RPC Performance**: Request rates, latency percentiles by command
- **Transaction Overview**: Transaction processing rates and paths
- **Consensus Health**: Consensus round duration and proposer counts
- **Node Health (StatsD)**: Validated/published ledger age, operating mode, I/O latency, job queue
- **Network Traffic (StatsD)**: Peer counts, disconnects, overlay traffic by category
- **RPC & Pathfinding (StatsD)**: RPC request rate/time/size, pathfinding duration, resource warnings

Pre-configured datasources:

Expand Down
3 changes: 2 additions & 1 deletion docker/telemetry/docker-compose.yml
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,8 @@ services:
ports:
- "4317:4317" # OTLP gRPC
- "4318:4318" # OTLP HTTP
- "8889:8889" # Prometheus metrics (spanmetrics)
- "8125:8125/udp" # StatsD UDP (beast::insight metrics)
- "8889:8889" # Prometheus metrics (spanmetrics + statsd)
- "13133:13133" # Health check
volumes:
- ./otel-collector-config.yaml:/etc/otel-collector-config.yaml:ro
Expand Down
Loading
Loading