Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 16 additions & 6 deletions docker/telemetry/TESTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -370,7 +370,7 @@ See the "Verification Queries" section below.

## Expected Span Catalog

All 12 production span names instrumented across Phases 2-4:
All 16 production span names instrumented across Phases 2-5:

| Span Name | Source File | Phase | Key Attributes | How to Trigger |
| --------------------------- | --------------------- | ----- | ---------------------------------------------------------- | ------------------------- |
Expand All @@ -380,10 +380,16 @@ All 12 production span names instrumented across Phases 2-4:
| `rpc.command.<name>` | RPCHandler.cpp:161 | 2 | `xrpl.rpc.command`, `xrpl.rpc.version`, `xrpl.rpc.role` | Any RPC command |
| `tx.process` | NetworkOPs.cpp:1227 | 3 | `xrpl.tx.hash`, `xrpl.tx.local`, `xrpl.tx.path` | Submit transaction |
| `tx.receive` | PeerImp.cpp:1273 | 3 | `xrpl.peer.id` | Peer relays transaction |
| `tx.apply` | BuildLedger.cpp:88 | 5 | `xrpl.ledger.tx_count`, `xrpl.ledger.tx_failed` | Ledger close (tx set) |
| `consensus.proposal.send` | RCLConsensus.cpp:177 | 4 | `xrpl.consensus.round` | Consensus proposing phase |
| `consensus.ledger_close` | RCLConsensus.cpp:282 | 4 | `xrpl.consensus.ledger.seq`, `xrpl.consensus.mode` | Ledger close event |
| `consensus.accept` | RCLConsensus.cpp:395 | 4 | `xrpl.consensus.proposers`, `xrpl.consensus.round_time_ms` | Ledger accepted |
| `consensus.validation.send` | RCLConsensus.cpp:753 | 4 | `xrpl.consensus.ledger.seq`, `xrpl.consensus.proposing` | Validation sent |
| `ledger.build` | BuildLedger.cpp:31 | 5 | `xrpl.ledger.seq` | Ledger build |
| `ledger.validate` | LedgerMaster.cpp:915 | 5 | `xrpl.ledger.seq`, `xrpl.ledger.validations` | Ledger validated |
| `ledger.store` | LedgerMaster.cpp:409 | 5 | `xrpl.ledger.seq` | Ledger stored |
| `peer.proposal.receive` | PeerImp.cpp:1667 | 5 | `xrpl.peer.id`, `xrpl.peer.proposal.trusted` | Peer sends proposal |
| `peer.validation.receive` | PeerImp.cpp:2264 | 5 | `xrpl.peer.id`, `xrpl.peer.validation.trusted` | Peer sends validation |

---

Expand All @@ -405,9 +411,11 @@ curl -s "$JAEGER/api/services/rippled/operations" | jq '.data'
# Query traces by operation
for op in "rpc.request" "rpc.process" \
"rpc.command.server_info" "rpc.command.server_state" "rpc.command.ledger" \
"tx.process" "tx.receive" \
"tx.process" "tx.receive" "tx.apply" \
"consensus.proposal.send" "consensus.ledger_close" \
"consensus.accept" "consensus.validation.send"; do
"consensus.accept" "consensus.validation.send" \
"ledger.build" "ledger.validate" "ledger.store" \
"peer.proposal.receive" "peer.validation.receive"; do
count=$(curl -s "$JAEGER/api/traces?service=rippled&operation=$op&limit=5&lookback=1h" \
| jq '.data | length')
printf "%-35s %s traces\n" "$op" "$count"
Expand Down Expand Up @@ -440,9 +448,11 @@ Open http://localhost:3000 (anonymous admin access enabled).

Pre-configured dashboards:

- **RPC Performance**: Request rates, latency percentiles by command
- **Transaction Overview**: Transaction processing rates and paths
- **Consensus Health**: Consensus round duration and proposer counts
- **RPC Performance**: Request rates, latency percentiles by command, top commands, WebSocket rate
- **Transaction Overview**: Transaction processing rates, apply duration, peer relay, failed tx rate
- **Consensus Health**: Consensus round duration, proposer counts, mode tracking, accept heatmap
- **Ledger Operations**: Build/validate/store rates and durations, TX apply metrics
- **Peer Network**: Proposal/validation receive rates, trusted vs untrusted breakdown (requires `trace_peer=1`)

Pre-configured datasources:

Expand Down
168 changes: 160 additions & 8 deletions docker/telemetry/grafana/dashboards/consensus-health.json
Original file line number Diff line number Diff line change
Expand Up @@ -8,72 +8,109 @@
"panels": [
{
"title": "Consensus Round Duration",
"description": "p95 and p50 duration of consensus accept rounds. The consensus.accept span (RCLConsensus.cpp:395) measures the time to process an accepted ledger including transaction application and state finalization. The span carries xrpl.consensus.proposers and xrpl.consensus.round_time_ms attributes. Normal range is 3-6 seconds on mainnet.",
"type": "timeseries",
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 0 },
"options": {
"tooltip": { "mode": "multi", "sort": "desc" }
},
"targets": [
{
"datasource": { "type": "prometheus" },
"expr": "histogram_quantile(0.95, sum by (le) (rate(traces_span_metrics_duration_milliseconds_bucket{span_name=\"consensus.accept\"}[5m])))",
"legendFormat": "p95 round duration"
"legendFormat": "P95 Round Duration"
},
{
"datasource": { "type": "prometheus" },
"expr": "histogram_quantile(0.50, sum by (le) (rate(traces_span_metrics_duration_milliseconds_bucket{span_name=\"consensus.accept\"}[5m])))",
"legendFormat": "p50 round duration"
"legendFormat": "P50 Round Duration"
}
],
"fieldConfig": {
"defaults": {
"unit": "ms"
"unit": "ms",
"custom": {
"axisLabel": "Duration (ms)",
"spanNulls": true,
"insertNulls": false,
"showPoints": "auto",
"pointSize": 3
}
},
"overrides": []
}
},
{
"title": "Consensus Proposals Sent Rate",
"description": "Rate at which this node sends consensus proposals to the network. Sourced from the consensus.proposal.send span (RCLConsensus.cpp:177) which fires each time the node proposes a transaction set. The span carries xrpl.consensus.round identifying the consensus round number. A healthy proposing node should show steady proposal output.",
"type": "timeseries",
"gridPos": { "h": 8, "w": 12, "x": 12, "y": 0 },
"options": {
"tooltip": { "mode": "multi", "sort": "desc" }
},
"targets": [
{
"datasource": { "type": "prometheus" },
"expr": "sum(rate(traces_span_metrics_calls_total{span_name=\"consensus.proposal.send\"}[5m]))",
"legendFormat": "proposals/sec"
"legendFormat": "Proposals / Sec"
}
],
"fieldConfig": {
"defaults": {
"unit": "ops"
"unit": "ops",
"custom": {
"axisLabel": "Proposals / Sec",
"spanNulls": true,
"insertNulls": false,
"showPoints": "auto",
"pointSize": 3
}
},
"overrides": []
}
},
{
"title": "Ledger Close Duration",
"description": "p95 duration of the ledger close event. The consensus.ledger_close span (RCLConsensus.cpp:282) measures the time from when consensus triggers a ledger close to completion. Carries xrpl.consensus.ledger.seq and xrpl.consensus.mode attributes. Compare with Consensus Round Duration to understand how close timing relates to overall round time.",
"type": "timeseries",
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 8 },
"options": {
"tooltip": { "mode": "multi", "sort": "desc" }
},
"targets": [
{
"datasource": { "type": "prometheus" },
"expr": "histogram_quantile(0.95, sum by (le) (rate(traces_span_metrics_duration_milliseconds_bucket{span_name=\"consensus.ledger_close\"}[5m])))",
"legendFormat": "p95 close duration"
"legendFormat": "P95 Close Duration"
}
],
"fieldConfig": {
"defaults": {
"unit": "ms"
"unit": "ms",
"custom": {
"axisLabel": "Duration (ms)",
"spanNulls": true,
"insertNulls": false,
"showPoints": "auto",
"pointSize": 3
}
},
"overrides": []
}
},
{
"title": "Validation Send Rate",
"description": "Rate at which this node sends ledger validations to the network. Sourced from the consensus.validation.send span (RCLConsensus.cpp:753). Each validation confirms the node has fully validated a ledger. The span carries xrpl.consensus.ledger.seq and xrpl.consensus.proposing. Should closely track the ledger close rate when the node is healthy.",
"type": "stat",
"gridPos": { "h": 8, "w": 12, "x": 12, "y": 8 },
"options": {
"tooltip": { "mode": "multi", "sort": "desc" }
},
"targets": [
{
"datasource": { "type": "prometheus" },
"expr": "sum(rate(traces_span_metrics_calls_total{span_name=\"consensus.validation.send\"}[5m]))",
"legendFormat": "validations/sec"
"legendFormat": "Validations / Sec"
}
],
"fieldConfig": {
Expand All @@ -82,6 +119,121 @@
},
"overrides": []
}
},
{
"title": "Consensus Mode Over Time",
"description": "Breakdown of consensus ledger close events by the node's consensus mode (proposing, observing, wrongLedger, switchedLedger). Grouped by the xrpl.consensus.mode span attribute from consensus.ledger_close. A healthy validator should be predominantly in 'proposing' mode. Frequent 'wrongLedger' or 'switchedLedger' indicates sync issues.",
"type": "timeseries",
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 16 },
"options": {
"tooltip": { "mode": "multi", "sort": "desc" }
},
"targets": [
{
"datasource": { "type": "prometheus" },
"expr": "sum by (xrpl_consensus_mode) (rate(traces_span_metrics_calls_total{span_name=\"consensus.ledger_close\"}[5m]))",
"legendFormat": "{{xrpl_consensus_mode}}"
}
],
"fieldConfig": {
"defaults": {
"unit": "ops",
"custom": {
"axisLabel": "Events / Sec",
"spanNulls": true,
"insertNulls": false,
"showPoints": "auto",
"pointSize": 3
}
},
"overrides": []
}
},
{
"title": "Accept vs Close Rate",
"description": "Compares the rate of consensus.accept (ledger accepted after consensus) vs consensus.ledger_close (ledger close initiated). These should track closely in a healthy network. A divergence means some close events are not completing the accept phase, potentially indicating consensus failures or timeouts.",
"type": "timeseries",
"gridPos": { "h": 8, "w": 12, "x": 12, "y": 16 },
"options": {
"tooltip": { "mode": "multi", "sort": "desc" }
},
"targets": [
{
"datasource": { "type": "prometheus" },
"expr": "sum(rate(traces_span_metrics_calls_total{span_name=\"consensus.accept\"}[5m]))",
"legendFormat": "Accepts / Sec"
},
{
"datasource": { "type": "prometheus" },
"expr": "sum(rate(traces_span_metrics_calls_total{span_name=\"consensus.ledger_close\"}[5m]))",
"legendFormat": "Closes / Sec"
}
],
"fieldConfig": {
"defaults": {
"unit": "ops",
"custom": {
"axisLabel": "Events / Sec",
"spanNulls": true,
"insertNulls": false,
"showPoints": "auto",
"pointSize": 3
}
},
"overrides": []
}
},
{
"title": "Validation vs Close Rate",
"description": "Compares the rate of consensus.validation.send vs consensus.ledger_close. Each validated ledger should produce one validation message. If validations lag behind closes, the node may be falling behind on validation or experiencing issues with the validation pipeline.",
"type": "timeseries",
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 24 },
"options": {
"tooltip": { "mode": "multi", "sort": "desc" }
},
"targets": [
{
"datasource": { "type": "prometheus" },
"expr": "sum(rate(traces_span_metrics_calls_total{span_name=\"consensus.validation.send\"}[5m]))",
"legendFormat": "Validations / Sec"
},
{
"datasource": { "type": "prometheus" },
"expr": "sum(rate(traces_span_metrics_calls_total{span_name=\"consensus.ledger_close\"}[5m]))",
"legendFormat": "Closes / Sec"
}
],
"fieldConfig": {
"defaults": {
"unit": "ops",
"custom": {
"axisLabel": "Events / Sec",
"spanNulls": true,
"insertNulls": false,
"showPoints": "auto",
"pointSize": 3
}
},
"overrides": []
}
},
{
"title": "Consensus Accept Duration Heatmap",
"description": "Heatmap showing the distribution of consensus.accept span durations across histogram buckets over time. Each cell represents how many accept events fell into that duration bucket in a 5m window. Useful for detecting outlier consensus rounds that take abnormally long.",
"type": "heatmap",
"gridPos": { "h": 8, "w": 12, "x": 12, "y": 24 },
"options": {
"tooltip": { "mode": "multi", "sort": "desc" },
"yAxis": { "axisLabel": "Duration (ms)" }
},
"targets": [
{
"datasource": { "type": "prometheus" },
"expr": "sum(increase(traces_span_metrics_duration_milliseconds_bucket{span_name=\"consensus.accept\"}[5m])) by (le)",
"legendFormat": "{{le}}",
"format": "heatmap"
}
]
}
],
"schemaVersion": 39,
Expand Down
Loading
Loading