Skip to content

Commit e2b2589

Browse files
Phase 5b: Add ledger/peer/tx spans + expand Grafana dashboards
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent 85f583f commit e2b2589

File tree

12 files changed

+940
-51
lines changed

12 files changed

+940
-51
lines changed

docker/telemetry/TESTING.md

Lines changed: 16 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -370,7 +370,7 @@ See the "Verification Queries" section below.
370370

371371
## Expected Span Catalog
372372

373-
All 12 production span names instrumented across Phases 2-4:
373+
All 16 production span names instrumented across Phases 2-5:
374374

375375
| Span Name | Source File | Phase | Key Attributes | How to Trigger |
376376
| --------------------------- | --------------------- | ----- | ---------------------------------------------------------- | ------------------------- |
@@ -380,10 +380,16 @@ All 12 production span names instrumented across Phases 2-4:
380380
| `rpc.command.<name>` | RPCHandler.cpp:161 | 2 | `xrpl.rpc.command`, `xrpl.rpc.version`, `xrpl.rpc.role` | Any RPC command |
381381
| `tx.process` | NetworkOPs.cpp:1227 | 3 | `xrpl.tx.hash`, `xrpl.tx.local`, `xrpl.tx.path` | Submit transaction |
382382
| `tx.receive` | PeerImp.cpp:1273 | 3 | `xrpl.peer.id` | Peer relays transaction |
383+
| `tx.apply` | BuildLedger.cpp:88 | 5 | `xrpl.ledger.tx_count`, `xrpl.ledger.tx_failed` | Ledger close (tx set) |
383384
| `consensus.proposal.send` | RCLConsensus.cpp:177 | 4 | `xrpl.consensus.round` | Consensus proposing phase |
384385
| `consensus.ledger_close` | RCLConsensus.cpp:282 | 4 | `xrpl.consensus.ledger.seq`, `xrpl.consensus.mode` | Ledger close event |
385386
| `consensus.accept` | RCLConsensus.cpp:395 | 4 | `xrpl.consensus.proposers`, `xrpl.consensus.round_time_ms` | Ledger accepted |
386387
| `consensus.validation.send` | RCLConsensus.cpp:753 | 4 | `xrpl.consensus.ledger.seq`, `xrpl.consensus.proposing` | Validation sent |
388+
| `ledger.build` | BuildLedger.cpp:31 | 5 | `xrpl.ledger.seq` | Ledger build |
389+
| `ledger.validate` | LedgerMaster.cpp:915 | 5 | `xrpl.ledger.seq`, `xrpl.ledger.validations` | Ledger validated |
390+
| `ledger.store` | LedgerMaster.cpp:409 | 5 | `xrpl.ledger.seq` | Ledger stored |
391+
| `peer.proposal.receive` | PeerImp.cpp:1667 | 5 | `xrpl.peer.id`, `xrpl.peer.proposal.trusted` | Peer sends proposal |
392+
| `peer.validation.receive` | PeerImp.cpp:2264 | 5 | `xrpl.peer.id`, `xrpl.peer.validation.trusted` | Peer sends validation |
387393

388394
---
389395

@@ -405,9 +411,11 @@ curl -s "$JAEGER/api/services/rippled/operations" | jq '.data'
405411
# Query traces by operation
406412
for op in "rpc.request" "rpc.process" \
407413
"rpc.command.server_info" "rpc.command.server_state" "rpc.command.ledger" \
408-
"tx.process" "tx.receive" \
414+
"tx.process" "tx.receive" "tx.apply" \
409415
"consensus.proposal.send" "consensus.ledger_close" \
410-
"consensus.accept" "consensus.validation.send"; do
416+
"consensus.accept" "consensus.validation.send" \
417+
"ledger.build" "ledger.validate" "ledger.store" \
418+
"peer.proposal.receive" "peer.validation.receive"; do
411419
count=$(curl -s "$JAEGER/api/traces?service=rippled&operation=$op&limit=5&lookback=1h" \
412420
| jq '.data | length')
413421
printf "%-35s %s traces\n" "$op" "$count"
@@ -440,9 +448,11 @@ Open http://localhost:3000 (anonymous admin access enabled).
440448

441449
Pre-configured dashboards:
442450

443-
- **RPC Performance**: Request rates, latency percentiles by command
444-
- **Transaction Overview**: Transaction processing rates and paths
445-
- **Consensus Health**: Consensus round duration and proposer counts
451+
- **RPC Performance**: Request rates, latency percentiles by command, top commands, WebSocket rate
452+
- **Transaction Overview**: Transaction processing rates, apply duration, peer relay, failed tx rate
453+
- **Consensus Health**: Consensus round duration, proposer counts, mode tracking, accept heatmap
454+
- **Ledger Operations**: Build/validate/store rates and durations, TX apply metrics
455+
- **Peer Network**: Proposal/validation receive rates, trusted vs untrusted breakdown (requires `trace_peer=1`)
446456

447457
Pre-configured datasources:
448458

docker/telemetry/grafana/dashboards/consensus-health.json

Lines changed: 160 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -8,72 +8,109 @@
88
"panels": [
99
{
1010
"title": "Consensus Round Duration",
11+
"description": "p95 and p50 duration of consensus accept rounds. The consensus.accept span (RCLConsensus.cpp:395) measures the time to process an accepted ledger including transaction application and state finalization. The span carries xrpl.consensus.proposers and xrpl.consensus.round_time_ms attributes. Normal range is 3-6 seconds on mainnet.",
1112
"type": "timeseries",
1213
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 0 },
14+
"options": {
15+
"tooltip": { "mode": "multi", "sort": "desc" }
16+
},
1317
"targets": [
1418
{
1519
"datasource": { "type": "prometheus" },
1620
"expr": "histogram_quantile(0.95, sum by (le) (rate(traces_span_metrics_duration_milliseconds_bucket{span_name=\"consensus.accept\"}[5m])))",
17-
"legendFormat": "p95 round duration"
21+
"legendFormat": "P95 Round Duration"
1822
},
1923
{
2024
"datasource": { "type": "prometheus" },
2125
"expr": "histogram_quantile(0.50, sum by (le) (rate(traces_span_metrics_duration_milliseconds_bucket{span_name=\"consensus.accept\"}[5m])))",
22-
"legendFormat": "p50 round duration"
26+
"legendFormat": "P50 Round Duration"
2327
}
2428
],
2529
"fieldConfig": {
2630
"defaults": {
27-
"unit": "ms"
31+
"unit": "ms",
32+
"custom": {
33+
"axisLabel": "Duration (ms)",
34+
"spanNulls": true,
35+
"insertNulls": false,
36+
"showPoints": "auto",
37+
"pointSize": 3
38+
}
2839
},
2940
"overrides": []
3041
}
3142
},
3243
{
3344
"title": "Consensus Proposals Sent Rate",
45+
"description": "Rate at which this node sends consensus proposals to the network. Sourced from the consensus.proposal.send span (RCLConsensus.cpp:177) which fires each time the node proposes a transaction set. The span carries xrpl.consensus.round identifying the consensus round number. A healthy proposing node should show steady proposal output.",
3446
"type": "timeseries",
3547
"gridPos": { "h": 8, "w": 12, "x": 12, "y": 0 },
48+
"options": {
49+
"tooltip": { "mode": "multi", "sort": "desc" }
50+
},
3651
"targets": [
3752
{
3853
"datasource": { "type": "prometheus" },
3954
"expr": "sum(rate(traces_span_metrics_calls_total{span_name=\"consensus.proposal.send\"}[5m]))",
40-
"legendFormat": "proposals/sec"
55+
"legendFormat": "Proposals / Sec"
4156
}
4257
],
4358
"fieldConfig": {
4459
"defaults": {
45-
"unit": "ops"
60+
"unit": "ops",
61+
"custom": {
62+
"axisLabel": "Proposals / Sec",
63+
"spanNulls": true,
64+
"insertNulls": false,
65+
"showPoints": "auto",
66+
"pointSize": 3
67+
}
4668
},
4769
"overrides": []
4870
}
4971
},
5072
{
5173
"title": "Ledger Close Duration",
74+
"description": "p95 duration of the ledger close event. The consensus.ledger_close span (RCLConsensus.cpp:282) measures the time from when consensus triggers a ledger close to completion. Carries xrpl.consensus.ledger.seq and xrpl.consensus.mode attributes. Compare with Consensus Round Duration to understand how close timing relates to overall round time.",
5275
"type": "timeseries",
5376
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 8 },
77+
"options": {
78+
"tooltip": { "mode": "multi", "sort": "desc" }
79+
},
5480
"targets": [
5581
{
5682
"datasource": { "type": "prometheus" },
5783
"expr": "histogram_quantile(0.95, sum by (le) (rate(traces_span_metrics_duration_milliseconds_bucket{span_name=\"consensus.ledger_close\"}[5m])))",
58-
"legendFormat": "p95 close duration"
84+
"legendFormat": "P95 Close Duration"
5985
}
6086
],
6187
"fieldConfig": {
6288
"defaults": {
63-
"unit": "ms"
89+
"unit": "ms",
90+
"custom": {
91+
"axisLabel": "Duration (ms)",
92+
"spanNulls": true,
93+
"insertNulls": false,
94+
"showPoints": "auto",
95+
"pointSize": 3
96+
}
6497
},
6598
"overrides": []
6699
}
67100
},
68101
{
69102
"title": "Validation Send Rate",
103+
"description": "Rate at which this node sends ledger validations to the network. Sourced from the consensus.validation.send span (RCLConsensus.cpp:753). Each validation confirms the node has fully validated a ledger. The span carries xrpl.consensus.ledger.seq and xrpl.consensus.proposing. Should closely track the ledger close rate when the node is healthy.",
70104
"type": "stat",
71105
"gridPos": { "h": 8, "w": 12, "x": 12, "y": 8 },
106+
"options": {
107+
"tooltip": { "mode": "multi", "sort": "desc" }
108+
},
72109
"targets": [
73110
{
74111
"datasource": { "type": "prometheus" },
75112
"expr": "sum(rate(traces_span_metrics_calls_total{span_name=\"consensus.validation.send\"}[5m]))",
76-
"legendFormat": "validations/sec"
113+
"legendFormat": "Validations / Sec"
77114
}
78115
],
79116
"fieldConfig": {
@@ -82,6 +119,121 @@
82119
},
83120
"overrides": []
84121
}
122+
},
123+
{
124+
"title": "Consensus Mode Over Time",
125+
"description": "Breakdown of consensus ledger close events by the node's consensus mode (proposing, observing, wrongLedger, switchedLedger). Grouped by the xrpl.consensus.mode span attribute from consensus.ledger_close. A healthy validator should be predominantly in 'proposing' mode. Frequent 'wrongLedger' or 'switchedLedger' indicates sync issues.",
126+
"type": "timeseries",
127+
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 16 },
128+
"options": {
129+
"tooltip": { "mode": "multi", "sort": "desc" }
130+
},
131+
"targets": [
132+
{
133+
"datasource": { "type": "prometheus" },
134+
"expr": "sum by (xrpl_consensus_mode) (rate(traces_span_metrics_calls_total{span_name=\"consensus.ledger_close\"}[5m]))",
135+
"legendFormat": "{{xrpl_consensus_mode}}"
136+
}
137+
],
138+
"fieldConfig": {
139+
"defaults": {
140+
"unit": "ops",
141+
"custom": {
142+
"axisLabel": "Events / Sec",
143+
"spanNulls": true,
144+
"insertNulls": false,
145+
"showPoints": "auto",
146+
"pointSize": 3
147+
}
148+
},
149+
"overrides": []
150+
}
151+
},
152+
{
153+
"title": "Accept vs Close Rate",
154+
"description": "Compares the rate of consensus.accept (ledger accepted after consensus) vs consensus.ledger_close (ledger close initiated). These should track closely in a healthy network. A divergence means some close events are not completing the accept phase, potentially indicating consensus failures or timeouts.",
155+
"type": "timeseries",
156+
"gridPos": { "h": 8, "w": 12, "x": 12, "y": 16 },
157+
"options": {
158+
"tooltip": { "mode": "multi", "sort": "desc" }
159+
},
160+
"targets": [
161+
{
162+
"datasource": { "type": "prometheus" },
163+
"expr": "sum(rate(traces_span_metrics_calls_total{span_name=\"consensus.accept\"}[5m]))",
164+
"legendFormat": "Accepts / Sec"
165+
},
166+
{
167+
"datasource": { "type": "prometheus" },
168+
"expr": "sum(rate(traces_span_metrics_calls_total{span_name=\"consensus.ledger_close\"}[5m]))",
169+
"legendFormat": "Closes / Sec"
170+
}
171+
],
172+
"fieldConfig": {
173+
"defaults": {
174+
"unit": "ops",
175+
"custom": {
176+
"axisLabel": "Events / Sec",
177+
"spanNulls": true,
178+
"insertNulls": false,
179+
"showPoints": "auto",
180+
"pointSize": 3
181+
}
182+
},
183+
"overrides": []
184+
}
185+
},
186+
{
187+
"title": "Validation vs Close Rate",
188+
"description": "Compares the rate of consensus.validation.send vs consensus.ledger_close. Each validated ledger should produce one validation message. If validations lag behind closes, the node may be falling behind on validation or experiencing issues with the validation pipeline.",
189+
"type": "timeseries",
190+
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 24 },
191+
"options": {
192+
"tooltip": { "mode": "multi", "sort": "desc" }
193+
},
194+
"targets": [
195+
{
196+
"datasource": { "type": "prometheus" },
197+
"expr": "sum(rate(traces_span_metrics_calls_total{span_name=\"consensus.validation.send\"}[5m]))",
198+
"legendFormat": "Validations / Sec"
199+
},
200+
{
201+
"datasource": { "type": "prometheus" },
202+
"expr": "sum(rate(traces_span_metrics_calls_total{span_name=\"consensus.ledger_close\"}[5m]))",
203+
"legendFormat": "Closes / Sec"
204+
}
205+
],
206+
"fieldConfig": {
207+
"defaults": {
208+
"unit": "ops",
209+
"custom": {
210+
"axisLabel": "Events / Sec",
211+
"spanNulls": true,
212+
"insertNulls": false,
213+
"showPoints": "auto",
214+
"pointSize": 3
215+
}
216+
},
217+
"overrides": []
218+
}
219+
},
220+
{
221+
"title": "Consensus Accept Duration Heatmap",
222+
"description": "Heatmap showing the distribution of consensus.accept span durations across histogram buckets over time. Each cell represents how many accept events fell into that duration bucket in a 5m window. Useful for detecting outlier consensus rounds that take abnormally long.",
223+
"type": "heatmap",
224+
"gridPos": { "h": 8, "w": 12, "x": 12, "y": 24 },
225+
"options": {
226+
"tooltip": { "mode": "multi", "sort": "desc" },
227+
"yAxis": { "axisLabel": "Duration (ms)" }
228+
},
229+
"targets": [
230+
{
231+
"datasource": { "type": "prometheus" },
232+
"expr": "sum(increase(traces_span_metrics_duration_milliseconds_bucket{span_name=\"consensus.accept\"}[5m])) by (le)",
233+
"legendFormat": "{{le}}",
234+
"format": "heatmap"
235+
}
236+
]
85237
}
86238
],
87239
"schemaVersion": 39,

0 commit comments

Comments
 (0)