Skip to content

Commit fc1ed3c

Browse files
Add span/metric/Grafana reference to telemetry runbook
Documents the complete telemetry pipeline: 10 span names with source files and attributes, 4 Prometheus metrics from the spanmetrics connector, dimension-to-label mappings, and per-panel PromQL queries for all 3 Grafana dashboards. Includes a summary table mapping each span to its Prometheus filter and Grafana dashboard. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent c700d2b commit fc1ed3c

File tree

1 file changed

+112
-5
lines changed

1 file changed

+112
-5
lines changed

docs/telemetry-runbook.md

Lines changed: 112 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -56,13 +56,120 @@ cmake --build --preset default
5656
| `use_tls` | `0` | Use TLS for exporter connection |
5757
| `tls_ca_cert` | (empty) | Path to CA certificate bundle |
5858

59-
## Grafana Dashboards
59+
## Span Reference
60+
61+
All spans instrumented in rippled, grouped by subsystem:
62+
63+
### RPC Spans (Phase 2)
64+
65+
| Span Name | Source File | Attributes | Description |
66+
| -------------------- | --------------------- | ------------------------------------------------------- | -------------------------------------------------- |
67+
| `rpc.request` | ServerHandler.cpp:271 || Top-level HTTP RPC request |
68+
| `rpc.process` | ServerHandler.cpp:573 || RPC processing (child of rpc.request) |
69+
| `rpc.ws_message` | ServerHandler.cpp:384 || WebSocket RPC message |
70+
| `rpc.command.<name>` | RPCHandler.cpp:161 | `xrpl.rpc.command`, `xrpl.rpc.version`, `xrpl.rpc.role` | Per-command span (e.g., `rpc.command.server_info`) |
71+
72+
### Transaction Spans (Phase 3)
73+
74+
| Span Name | Source File | Attributes | Description |
75+
| ------------ | ------------------- | ----------------------------------------------- | ------------------------------------- |
76+
| `tx.process` | NetworkOPs.cpp:1227 | `xrpl.tx.hash`, `xrpl.tx.local`, `xrpl.tx.path` | Transaction submission and processing |
77+
| `tx.receive` | PeerImp.cpp:1273 | `xrpl.peer.id` | Transaction received from peer relay |
78+
79+
### Consensus Spans (Phase 4)
80+
81+
| Span Name | Source File | Attributes | Description |
82+
| --------------------------- | -------------------- | ---------------------------------------------------------- | ---------------------------- |
83+
| `consensus.proposal.send` | RCLConsensus.cpp:177 | `xrpl.consensus.round` | Consensus proposal broadcast |
84+
| `consensus.ledger_close` | RCLConsensus.cpp:282 | `xrpl.consensus.ledger.seq`, `xrpl.consensus.mode` | Ledger close event |
85+
| `consensus.accept` | RCLConsensus.cpp:395 | `xrpl.consensus.proposers`, `xrpl.consensus.round_time_ms` | Ledger accepted by consensus |
86+
| `consensus.validation.send` | RCLConsensus.cpp:753 | `xrpl.consensus.ledger.seq`, `xrpl.consensus.proposing` | Validation sent after accept |
87+
88+
## Prometheus Metrics (Spanmetrics)
89+
90+
The OTel Collector's spanmetrics connector automatically derives RED (Rate, Errors, Duration) metrics from every span. No custom metrics code is needed in rippled.
91+
92+
### Generated Metric Names
93+
94+
| Prometheus Metric | Type | Description |
95+
| -------------------------------------------------- | --------- | ---------------------------- |
96+
| `traces_span_metrics_calls_total` | Counter | Total span invocations |
97+
| `traces_span_metrics_duration_milliseconds_bucket` | Histogram | Latency distribution buckets |
98+
| `traces_span_metrics_duration_milliseconds_count` | Histogram | Latency observation count |
99+
| `traces_span_metrics_duration_milliseconds_sum` | Histogram | Cumulative latency |
60100

61-
Three dashboards are pre-provisioned:
101+
### Metric Labels
102+
103+
Every metric carries these standard labels:
104+
105+
| Label | Source | Example |
106+
| -------------- | ------------------ | ---------------------------------------- |
107+
| `span_name` | Span name | `rpc.command.server_info` |
108+
| `status_code` | Span status | `STATUS_CODE_UNSET`, `STATUS_CODE_ERROR` |
109+
| `service_name` | Resource attribute | `rippled` |
110+
| `span_kind` | Span kind | `SPAN_KIND_INTERNAL` |
111+
112+
Additionally, span attributes configured as dimensions in the collector become metric labels (dots → underscores):
113+
114+
| Span Attribute | Metric Label | Applies To |
115+
| --------------------- | --------------------- | ------------------------------ |
116+
| `xrpl.rpc.command` | `xrpl_rpc_command` | `rpc.command.*` spans |
117+
| `xrpl.rpc.status` | `xrpl_rpc_status` | `rpc.command.*` spans |
118+
| `xrpl.consensus.mode` | `xrpl_consensus_mode` | `consensus.ledger_close` spans |
119+
| `xrpl.tx.local` | `xrpl_tx_local` | `tx.process` spans |
120+
121+
### Histogram Buckets
122+
123+
Configured in `otel-collector-config.yaml`:
124+
125+
```
126+
1ms, 5ms, 10ms, 25ms, 50ms, 100ms, 250ms, 500ms, 1s, 5s
127+
```
128+
129+
## Grafana Dashboards
62130

63-
1. **RPC Performance** — Request rate, latency percentiles, error rate by command
64-
2. **Transaction Overview** — Processing rate, latency, sync/async distribution
65-
3. **Consensus Health** — Round duration, proposal rate, validation rate
131+
Three dashboards are pre-provisioned in `docker/telemetry/grafana/dashboards/`:
132+
133+
### RPC Performance (`rippled-rpc-perf`)
134+
135+
| Panel | Type | PromQL | Labels Used |
136+
| --------------------------- | ---------- | -------------------------------------------------------------------------------------------------------------------------------------------------- | --------------------------------- |
137+
| RPC Request Rate by Command | timeseries | `sum by (xrpl_rpc_command) (rate(traces_span_metrics_calls_total{span_name=~"rpc.command.*"}[5m]))` | `xrpl_rpc_command` |
138+
| RPC Latency p95 by Command | timeseries | `histogram_quantile(0.95, sum by (le, xrpl_rpc_command) (rate(traces_span_metrics_duration_milliseconds_bucket{span_name=~"rpc.command.*"}[5m])))` | `xrpl_rpc_command` |
139+
| RPC Error Rate | bargauge | Error spans / total spans × 100, grouped by `xrpl_rpc_command` | `xrpl_rpc_command`, `status_code` |
140+
| RPC Latency Heatmap | heatmap | `sum(increase(traces_span_metrics_duration_milliseconds_bucket{span_name=~"rpc.command.*"}[5m])) by (le)` | `le` (bucket boundaries) |
141+
142+
### Transaction Overview (`rippled-transactions`)
143+
144+
| Panel | Type | PromQL | Labels Used |
145+
| --------------------------------- | ---------- | -------------------------------------------------------------------------------------------- | --------------- |
146+
| Transaction Processing Rate | timeseries | `rate(traces_span_metrics_calls_total{span_name="tx.process"}[5m])` and `tx.receive` | `span_name` |
147+
| Transaction Processing Latency | timeseries | `histogram_quantile(0.95 / 0.50, ... {span_name="tx.process"})` ||
148+
| Transaction Path Distribution | piechart | `sum by (xrpl_tx_local) (rate(traces_span_metrics_calls_total{span_name="tx.process"}[5m]))` | `xrpl_tx_local` |
149+
| Transaction Receive vs Suppressed | timeseries | `rate(traces_span_metrics_calls_total{span_name="tx.receive"}[5m])` ||
150+
151+
### Consensus Health (`rippled-consensus`)
152+
153+
| Panel | Type | PromQL | Labels Used |
154+
| ----------------------------- | ---------- | ---------------------------------------------------------------------------------- | ----------- |
155+
| Consensus Round Duration | timeseries | `histogram_quantile(0.95 / 0.50, ... {span_name="consensus.accept"})` ||
156+
| Consensus Proposals Sent Rate | timeseries | `rate(traces_span_metrics_calls_total{span_name="consensus.proposal.send"}[5m])` ||
157+
| Ledger Close Duration | timeseries | `histogram_quantile(0.95, ... {span_name="consensus.ledger_close"})` ||
158+
| Validation Send Rate | stat | `rate(traces_span_metrics_calls_total{span_name="consensus.validation.send"}[5m])` ||
159+
160+
### Span → Metric → Dashboard Summary
161+
162+
| Span Name | Prometheus Metric Filter | Grafana Dashboard |
163+
| --------------------------- | ----------------------------------------- | ---------------------------------- |
164+
| `rpc.request` | `{span_name="rpc.request"}` | — (available but not paneled) |
165+
| `rpc.process` | `{span_name="rpc.process"}` | — (available but not paneled) |
166+
| `rpc.command.*` | `{span_name=~"rpc.command.*"}` | RPC Performance (all 4 panels) |
167+
| `tx.process` | `{span_name="tx.process"}` | Transaction Overview (3 panels) |
168+
| `tx.receive` | `{span_name="tx.receive"}` | Transaction Overview (2 panels) |
169+
| `consensus.accept` | `{span_name="consensus.accept"}` | Consensus Health (Round Duration) |
170+
| `consensus.proposal.send` | `{span_name="consensus.proposal.send"}` | Consensus Health (Proposals Rate) |
171+
| `consensus.ledger_close` | `{span_name="consensus.ledger_close"}` | Consensus Health (Close Duration) |
172+
| `consensus.validation.send` | `{span_name="consensus.validation.send"}` | Consensus Health (Validation Rate) |
66173

67174
## Troubleshooting
68175

0 commit comments

Comments
 (0)