Unified telemetry: sharing logs and metrics between node operators #5421

bthomee · 2025-05-05T10:34:04Z

bthomee
May 5, 2025
Maintainer

Summary

We propose to use a Grafana instance managed by the XRPL Foundation to unify telemetry sent by nodes whose operators opt in to participate.
- Useless data in the telemetry can be filtered before it is sent, to reduce storage and egress costs for operators.
- Sensitive data in the logs can be redacted before it is sent, to protect private information of operators, such as IP addresses.
Each operator can optionally deploy their own Grafana instance and query their own unfiltered and unredacted telemetry.

1 Introduction

Instrumentation is a vital component of every software system, because it enables proactive monitoring of its health and makes it possible to troubleshoot the system during and after incidents.

The rippled binary currently has rudimentary instrumentation in place, whereby only statsd metrics and unstructured logs are emitted. What to do with the captured telemetry is up to each individual node operator. Ripple, for instance, has a Grafana Cloud instance configured that collects the logs and system metrics for both of its mainnet nodes, allowing their teams to monitor and debug its behavior; the statsd metrics are currently not used.

Due to the inherently distributed nature of the XRPL, analyzing the telemetry of just one node yields only a partial picture in case of a problem affecting the ledger. In recent incidents it has therefore been necessary to reach out to other node operators to request their logs to get additional perspectives. Although those logs were helpful, as other operators typically do not capture them at debug level, they lacked sufficient depth to be truly useful.

In this document we propose an opt-in solution to achieve unified telemetry. Each node operator is encouraged to make their telemetry available to the other operators, allowing joint troubleshooting on the shared data.

2 Proposed solution

We propose that node operators install a Grafana component named Alloy on a machine in their network, which would scrape logs and metrics from rippled and send them to a Grafana instance managed by the XRPL Foundation where the logs are ingested by Loki and the metrics by Prometheus. To protect sensitive data and to save costs, Alloy can be configured to filter out and/or redact specific content from the logs and metrics.

Node operators who participate in the sharing of their telemetry are given a user account on the Grafana instance, from where they can query the telemetry of any and all node operators. Each node is given a unique identifier, which makes it possible to drill down into the logs and metrics of a specific node by issuing a query containing their identifier.

A node operator can optionally deploy their own Grafana instance containing the Loki and Prometheus components on the same machine in their network to access the unfiltered and unredacted telemetry locally. This is also useful for when an operator runs proxy nodes in their network whose telemetry they choose to not send to the Grafana instance managed by the Foundation.
An overview of the setup for a node operator is shown in Figure 1, and how multiple node operators work together is shown in Figure 2.

Figure 1. Overview of telemetry collection for an operator that manages a validator node (server A) and two proxy nodes (servers B and C), using a separate machine for processing the telemetry. The logs and metrics are scraped from all servers by the Alloy component, which can apply an initial filter to drop unnecessary logs and metrics. A node operator can optionally install Loki and Prometheus, as well as the UI to query their own telemetry. Before sending the data to the Grafana instance managed by the Foundation, additional filtering can be performed to save on egress costs, as well as redacting to protect sensitive information.

Figure 2. Overview of unifying the collected telemetry from multiple node operators on a machine managed by the Foundation. The Alloy component in the node operator pushes the filtered and redacted logs and metrics to the Loki and Prometheus components, respectively, in the Grafana instance managed by the Foundation. Each node operator receives a Grafana user account, so they can log into the instance and issue queries via the UI.

We further propose that the Foundation assists the node operators with how to deploy and configure the Grafana components, via documentation and tutorials. In principle, any node operator can be part of this effort, not only those on the UNL. This notwithstanding, the participation of UNL nodes is crucial given their outsized role in the functioning of the XRPL, while that of non-UNL nodes is less so. To stimulate collaboration between node operators we will require that anyone who wishes to query the logs and metrics of other operators also make their own data available to all other participants.
In Section 4 we will discuss in more detail how we will implement this solution.

2.1 Cost estimates

2.1.1 Storage

Logs

Node operators

Participating nodes should enable debug logging, which needs approximately 30 GB of hard disk space each day. However, as the logs will be directly transmitted to the Grafana instance managed by the Foundation and are no longer needed for the purposes of this proposal, the operator is free to purge them afterwards.

An operator would generally want to keep the logs around for a bit longer, however, so they can debug their nodes themselves if needed. If an operator is interested in running their own Grafana instance for local debugging, then a recommendation would be to keep them for a minimum of 30 days, but ideally for a whole year, to enable historical analyses.

As Loki uses compression, the amount of actual storage required for its index and chunks on a rolling basis would reduce the 1 TB of monthly storage by 5-10x to 200 GB or less. This is an upper bound, since we expect unnecessary log messages to be aggressively filtered out by node operators before being stored.

Assuming the machine is hosted in the cloud, using an AWS EC2 instance as example, an EBS volume would be needed, which costs $0.08 per GB or $16 a month. As node operators generally already use info logging, the incremental cost would only be the cost difference in disk space with debug logging.

Foundation

Assuming one year of historical data with older data being purged daily, the logs of each operator would total some 200 GB for the first month, 400 GB for the second month, until the year has been reached upon which the storage plateaus at 2.4 TB per operator per month; each month a new 200 GB is added but also the oldest 200 GB gets deleted. We estimate this would cost up to $192 per operator per month after a year of operation.

Metrics

Compared to logs, metrics take up negligible space on disk.

Traces

Traces are not currently captured, and we do not yet have a good idea of what this would look like in practice. Presumably the trace ID would be the ledger ID, but how to keep track of messages being sent concurrently and sequentially needs deeper thought. Also in terms of storage, we do not yet have a good idea how much data would be generated.

We will treat traces as being out-of-scope in the context of this proposal.

2.1.4 Compute

A separate machine or pod would likely need to be deployed to run the Grafana components, unless a suitable machine is already available that can be utilized. For a midrange machine, such as the t3.medium EC2 instance on AWS in us-east-2, this would cost $0.0416 per hour, or $30 per 30 days.

2.1.5 Network

Ingress typically is free, while egress is not. On AWS, egress from EC2 to the internet to a region such as us-east-2 costs $0.09 per GB. Like the data stored on disk, the data transmitted by Grafana is also compressed, which should yield a 5-10x reduction in size.

Node operators

The 200GB of compressed data to be sent would cost about $18 per month. The actual amount can be lower depending on how much filtering of logs and metrics is performed.

Foundation

The egress cost will be variable, as it depends on how many queries are issued and consequently how much data is returned. Assuming a monthly average of 1000 heavy queries at 100 MB of data each (e.g. actual debugging requests needing logs), and a monthly average of 100,000 light queries at 1MB each (e.g. status updates), this would approximately cost $20 per 30 days.

2.2 Advantages

2.2.1 Managed by the community

Each node operator can decide to participate in this effort. The XRPL Foundation will be tasked with coordinating with all participating node operators on how to configure their instances, as well as maintaining the Grafana instance.

2.2.2 Maintains privacy

UNL operators wish to preserve the privacy of their nodes as much as possible, e.g. they typically connect to the XRPL via proxies to keep their IP address hidden. This proposal does not meaningfully increase the risk, as Grafana can be hosted on a separate machine and its communications can also be routed through proxies. Moreover, sensitive data can be filtered or redacted before it is transmitted.

2.2.3 Increases monitoring

Even though most, if not all, operators have some monitoring in place for their node, they are currently left to their own devices. By providing instructions on how to set up and configure Grafana fully, we essentially establish the base level of monitoring that is acceptable. As node operators grow more comfortable with Grafana and improve their queries and dashboards, they can further share them with the community. In essence, our proposal can assist with making the XRPL more reliable.

2.3 Disadvantages

2.3.1 Data cost

The proposal benefits from enabling debug mode instead of info mode for logging, which increases the required storage. After some filtering, all telemetry is transmitted to the Grafana instance whether or not it is eventually used. The cost to node operators is approximately $64 per month ($16 storage, $30 compute, $18 network), while the cost to the Foundation is about $50 per month ($30 compute, $20 network) plus $192 per node operator for storage.

2.3.2 Data completeness

Volunteering node operators assume the responsibility for maintaining yet another system, and may not allocate sufficient resources to ensure it fully functions at all times. If that's the case, their data may not be transmitted to the Grafana instance and thus will not be available in times of need. However, this problem affects all alternative options, since if the telemetry is not accessible, irrespective of where it is hosted, it cannot be queried.

3 Alternatives considered

In this section we present several alternatives to the solution proposed in Section 2. Table 1 shows a high-level summary of all options, with the individual sections below describing the alternatives in more detail.

Note: The advantages and disadvantages are presented in relation to each other. For instance, all options support filtering and redacting of data, so this aspect is not mentioned in the table.

Note: The descriptions focus on AWS purely as an example. Other clouds or on-prem solutions may be feasible.

Table 1. Summary of the various proposed and alternative options.

3.1 Cross-cluster query federation

Grafana includes extra features in their Enterprise edition compared to using the cheaper Cloud or free Open Source editions. In particular, it supports cross-cluster query federation via a federation-frontend component that is not available in the other editions, which is able to reach out to various Grafana instances and seamlessly aggregate their results.

In this option, each node operator would need to install the Alloy, Loki, Prometheus, and Mimir components with multi-tenant support enabled. Mimir is needed because Prometheus does not support multi-tenancy.

User authentication would be performed via an identity provider, such as Okta or Keycloak, which would associate each node operator with a tenant identifier. The Foundation would be responsible for setting this up. Each operator would then be able to issue a federated query containing the tenants they wish to consult from their own Grafana instance, which then is transmitted to all nodes associated with these tenants, and the returned results combined. The Foundation would not need to deploy a Grafana instance.

In this option the Foundation would manage the contract with Grafana and request an Enterprise license on behalf of each participating node operator. Node operators would be expected to pay the license fee to the Foundation if they wish to participate.

Discussion

The advantage of cross-cluster query federation is that the data stays local to each node operator, substantially reducing the egress costs as data would only leave the network when it is the result of a federated query issued by another operator. However, the node operators would need to ensure their telemetry is retained for long enough for it to be useful, with a suggestion to keep a rolling window of a year of data; this would cost around $192 per month as calculated earlier.

A downside of this option is that there will be additional latency to get the query results, since data is returned from a multitude of nodes and then aggregated before it can be displayed. However, during a significant outage or actively abused exploit, time is of the essence and the latency may affect the ability to respond efficiently.

Another disadvantage of this approach is that each node operator would need to open their firewall to grant access to all the machines hosting a Grafana instance of another operator. Each federated query would need to exhaustively list all tenant identifiers of the nodes to consult. If a node joins or leaves the effort, which may happen several times a year, the firewall and queries would need to be updated.

To obtain Enterprise licenses it is necessary for the Foundation to reach out to Grafana's sales department for a quote. Anecdotally, people have reported it costs around $750 per license per year, with a minimum cost of $25k for all licenses together. Ultimately, this alternative is too expensive and requires too much ongoing work for both operators and the Foundation.

3.2 Cross-cluster telemetry aggregation

A reasonable workaround exists for the cross-cluster query federation option described above, so that Enterprise licenses are not needed. Namely, Grafana supports defining a Mixed data source that can combine the results of multiple data sources, which can be followed by a Transformation that further processes this data. For instance, the logs of multiple operators can be aggregated, after which they are ordered by timestamp.

Each node operator would need to add the Loki and Prometheus endpoints of each other operator as data sources in their Grafana instance; Mimir is not needed in this option. The Foundation would not need to deploy a Grafana instance.
Discussion

The telemetry would also remain local to each node operator, while being able to use the free open source version of Grafana. However, similar shortcomings as with the cross-cluster query federation option would also apply, with increased query response latency and needing to make updates in response to changes in operator participation.

It is further currently unclear whether mixed queries can produce the aggregate view we are looking for, as we have never used it before. While a single graph can show lines coming from multiple data sources, we do not know if it is possible to interleave logs data. That being said, interleaving log data may not be as useful anyway, given that the wall clock time of nodes can differ, making it difficult to track the exact sequence of events. In this option it is still possible to query the logs of a specific node.

3.3 Hub telemetry aggregation

This option combines the proposed solution with the cross-cluster telemetry aggregation. The node operators install Alloy, Loki, and Prometheus on their machine, while the Foundation maintains the Grafana instance that all operators can access. The Foundation manages the Mixed data sources, which fetch the logs and metrics by querying the respective Loki and Prometheus endpoints hosted by the node operators, as well as applying any Transformations to combine the data. Operators receive a user account on the Foundation instance from where they can issue their queries.

Discussion

In this option the node operators still need to permit inbound access to their network, but only to the Grafana instance managed by the Foundation. The list of data sources further needs to be only curated by the Foundation, since all querying takes place on the Grafana instance managed by them.

This is an attractive option, as ownership of logs and metrics remains with the node operators, returning only data when a query specifically requests it, while minimizing the effort required from all parties. However, as queries still need to reach out to all operators, the response latency will still be higher than if the telemetry were hosted locally in the Foundation instance.

3.4 Storage syncing

This option is a variant of the proposed solution. Here, the node operators install Alloy, Loki and Prometheus, as well as Mimir. Loki supports various storage backends, such as S3. Prometheus does not support them, but Mimir does.

Depending on preference, the node operator can manage the S3 storage backend and enable replication to another location under control of the Foundation. Alternatively the Foundation manages the storage backend where the data is directly written to.
The Foundation-managed Grafana instance would install Loki and Mimir, and configure their storage backends to use the S3 buckets in which the logs and metrics of all participating node operators are stored. For each node operator the Foundation would need to configure a separate data source, and they would need to maintain the Mixed data source and any needed Transformations.

If the Foundation manages the S3 bucket then there would be no cost for node operators, except for the egress costs to write the telemetry to it. As S3 costs $0.023 per GB, it would cost the Foundation about $5 per participating node operator per month of retained data. When an S3 bucket is used then no other storage would be needed.

Discussion

This option is currently hypothetical based on an analysis of the Grafana documentation. Filtering and redacting needs to happen before data is replicated, which prevents the node operator from querying their own data locally in full if they also install Grafana locally, unless the data is duplicated into additional data sources with no filtering and redacting.

Overall, this option isn't better than the proposed option while costing more.

3.5 Log syncing

In another variant of the proposed solution, the node operators do not expose any metrics (like the status quo) but replicate the debug logs to an S3 bucket managed by the Foundation, for instance via sync using the AWS CLI. Filtering and redacting needs to happen before the data is replicated, which needs a separate shell script that performs operations such as regex replace.
The Foundation then periodically ingests the logs of each node operator into their Grafana instance, associating each log file with each node operator to maintain the ability to tell the source of each record. In this option no separate machine would need to be deployed in the network of a node operator containing Grafana components, since the raw logs are directly transferred.

Discussion

This option reduces the amount of effort a node operator needs to do, as they only need to set up a periodic sync of the logs to S3, while increasing the burden on the Foundation to ensure the logs are ingested into the single Grafana as soon as they arrive. Depending on the frequency of ingestion, there may be a long delay between when something happened on the ledger and the corresponding logs are available for querying. The lack of metrics may further hinder debugging.

3.6 Non-Grafana

There are many other dashboarding, monitoring, and alerting solutions besides Grafana. We should not blindly use Grafana because we are familiar with them.

Discussion

The reason the proposal focuses on Grafana is because many of the developers working on rippled have built up expertise with the product, and have less experience with other solutions, such as DataDog.

3.7 Do nothing

The XRPL has been running pretty much uninterruptedly for over 10 years with only short halts that were easy to resolve or even resolved on their own. When there's a problem, the node operators have been coordinating among themselves. If logs are needed, these are shared at a reasonable pace. This proposal will demand extra investments in time and money from node operators and the Foundation for only a small benefit. The operators are already busy enough to have to additionally focus on setting up Grafana.

Discussion

While the sentiment is understandable, participation is voluntary so a node operator does not have to configure a Grafana instance if they do not want to. We believe that having access to more telemetry can be beneficial, but we do not need all 35 UNL validators to participate - just a few would already be helpful.

Even though the XRPL has been running mostly fine, as we have seen more and more transactions lately, we have also seen more and more issues with the ledger. The rippled code is complex and as new features keep getting added, the chances of unexpected interactions cropping up also increases. We need to get ahead of these problems before they occur, and ensuring we have as close to a complete picture of what is happening will be crucial.

4 Implementation

4.1 Configuring

The rippled.cfg file can be trivially changed so the rippled binary will write debug logs and expose statsd metrics, e.g. by setting these sections as follows:

[debug_logfile]
/mnt/alloy/debug.log

[insight]
server=statsd
address=127.0.0.1:9125
prefix=rippled

The /mnt/alloy can then be a mounted volume as read+write, which is also mounted by the Alloy component as read-only. There are also alternative methods to mounting a volume for writing to a file and tailing it from another machine, such as over SSH. As each node operator has a different network and machine set up, a tailored approach will be needed. The Foundation will coordinate with each operator and assist with configuring the nodes to instrument.

Node operators may need to make adjustments to their firewall rules to allow the logs and metrics to be scraped from their servers by Alloy, as well as possible changes to the rules to allow them to be pushed to the Grafana instance. The Foundation would similarly need to modify their firewall to expose their Grafana instance, so that the pushed telemetry can be received, and node operators can log into their accounts. Each time the list of participating operators changes the Foundation will need to add and/or remove inbound routes as appropriate, and update the user accounts.

A node operator can further choose to only make the logs and metrics of their principal node available (e.g. the validator), while keeping the telemetry of the other nodes (e.g. the proxies) internal. Each node should be assigned a unique name, so they are distinguishable from the nodes of other operators.

4.2 Deploying

To deploy the Grafana components, we propose to supply a Docker image and a shell script, depending on the preference of the node operator. The installer can deploy the Grafana components on any machine, which can be the server where rippled runs or a different one. We will provide a similar image or script for the Foundation.

The node operators and the Foundation should collaborate on best practices for what data to show in the default dashboard widgets and what queries should be offered by default to make monitoring and troubleshooting as easy as possible. Any updates to the Grafana configurations should be shared among each other.

We suggest creating a new XRPLF repository where the Docker images, shell scripts, and Grafana configurations can be hosted.

5 Conclusion

The proposed solution leverages Grafana's free open source version to collect logs and metrics from participating node operators and to make them available for querying. The Foundation will be tasked with assisting the node operators with deploying and configuring the Grafana components in their network, as well as for hosting the Grafana instance from where the unified telemetry can be queried. The estimated costs for a participating node operator are $64 per month, while for the Foundation the costs are $50 a month plus an additional $192 per node operator. Although alternative approaches exist, these are either more expensive, less functional, more complex, and/or require more effort to maintain.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Unified telemetry: sharing logs and metrics between node operators #5421

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Unified telemetry: sharing logs and metrics between node operators #5421

Uh oh!

bthomee May 5, 2025 Maintainer

Summary

1 Introduction

2 Proposed solution

2.1 Cost estimates

2.1.1 Storage

Logs

Node operators

Foundation

Metrics

Traces

2.1.4 Compute

2.1.5 Network

Node operators

Foundation

2.2 Advantages

2.2.1 Managed by the community

2.2.2 Maintains privacy

2.2.3 Increases monitoring

2.3 Disadvantages

2.3.1 Data cost

2.3.2 Data completeness

3 Alternatives considered

3.1 Cross-cluster query federation

Discussion

3.2 Cross-cluster telemetry aggregation

3.3 Hub telemetry aggregation

Discussion

3.4 Storage syncing

Discussion

3.5 Log syncing

Discussion

3.6 Non-Grafana

Discussion

3.7 Do nothing

Discussion

4 Implementation

4.1 Configuring

4.2 Deploying

5 Conclusion

Replies: 0 comments

bthomee
May 5, 2025
Maintainer