Skip to content

Expose WebSocket connection count as prometheus metrics and some refactors #337

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

Patel-Raj11
Copy link

High Level Overview of Change

This PR accomplishes the following two tasks:

  1. Deprecate ws_url and connected columns from crawls table and add connection_health to capture their true values.
  2. Expose number of active WebSocket connections for a network in prometheus exposition format.

Context of Change

  1. ws_url and connected columns from crawls table were not capturing all the active WebSocket connections since, we were creating additional WebSocket connections from using entry column of networks table as well.
    This change, adds a connection_health table that captures all the WebSocket connections and their status update time. It also, fixes the bug of connected column of crawls remaining stale even after loosing a WebSocket connection.
  2. Exposes /metrics/:network that can be queried by Grafana to capture the number of connected nodes for each network and configure alerts.
  3. Refactor the code to use xrpl.js library to fetch EnableAmendment Transaction and Amendments ledger entry as opposed to listening WebSocket events in setHandlers in order to make setHandlers more cleaner.

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Refactor (non-breaking change that only restructures code)
  • Tests (You added tests for code that already exists, or your new feature included in this PR)
  • Documentation Updates
  • Release

Before / After

No change in the working on validator-history-service

Test Plan

Added tests in /tests directory to check for /metrics/:network endpoint, database updates, selects, and utils methods that were introduced by this change.
I will deploy this branch on staging to check its working on stating before releasing on production.

@mvadari
Copy link
Collaborator

mvadari commented Apr 28, 2025

Should these metrics only be exposed with a token of some sort, so they're not public?

@Patel-Raj11
Copy link
Author

Should these metrics only be exposed with a token of some sort, so they're not public?

These metrics are similar to what we had before but now, we are making them network specific and more accurate. I think they can remain public as they are health check metrics. Do you see a particular reason to make this endpoint authenticated?

@Patel-Raj11 Patel-Raj11 requested review from ckeshava and pdp2121 April 28, 2025 16:30
@pdp2121
Copy link
Collaborator

pdp2121 commented Apr 28, 2025

I would prefer these metrics to be tracked via Slack compared to exposing it publicly via API, since this is being used for internal monitoring purpose

@pdp2121
Copy link
Collaborator

pdp2121 commented Apr 28, 2025

Would be something similar to what we do with VCR:
https://github.com/xpring-eng/validation-count-reporter

@Patel-Raj11
Copy link
Author

Patel-Raj11 commented Apr 28, 2025

I would prefer these metrics to be tracked via Slack compared to exposing it publicly via API, since this is being used for internal monitoring purpose

@pdp2121

These metrics are going to be consumed by Grafana since they are exposed in prometheus exposition format for health checks and in-turn we will receive slack alerts from Grafana (as we do in #ripplex-alerting-dge-non-prod slack channel). In my opinion this is better approach since we will have decoupled alerting and application logic and we can move to a different alerting application like PagerDuty without modifying application logic at all. Just my thoughts, what do you think?

However, if exposing the number of connected nodes as public API is a concern, then are we going to deprecate /health endpoint as well? Since, that endpoint has been exposing the same metric but in JSON format and for all nodes.

I can create separate PR if we opt for push based approach to send messages to slack since it would be cleaner to review it separately.

@pdp2121
Copy link
Collaborator

pdp2121 commented Apr 28, 2025

These metrics are going to be consumed by Grafana since they are exposed in prometheus exposition format for health checks and in-turn we will receive slack alerts from Grafana (as we do in #ripplex-alerting-dge-non-prod slack channel). In my opinion this is better approach since we will have decoupled alerting and application logic and we can move to a different alerting application like PagerDuty without modifying application logic at all. Just my thoughts, what do you think?

If that's the case then I'm okay with exposing this data publicly since we have already listed number of connected node in health endpoint and we don't show this endpoint on the info list. @mvadari please let us know if you have any concerns

Comment on lines +61 to +71
if (await db().schema.hasColumn('crawls', 'ws_url')) {
await db().schema.alterTable('crawls', (table) => {
table.dropColumn('ws_url')
})
}

if (await db().schema.hasColumn('crawls', 'connected')) {
await db().schema.alterTable('crawls', (table) => {
table.dropColumn('connected')
})
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what are the implications of deleting this data? The crawls table has been populated over several iterations. I'd imagine that it takes at least a few days to re-populate the Connection-Health table with the equivalent set of records.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are removing ws_url and connected columns from crawls table. Those were populated when we create a connection to the node using vhs-connections service. There are no implications deleting this data. These columns in connections_health will be populated once the service starts.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

select count(distinct public_key) from public.crawls c where c.ws_url is not null; // returns 1089 in current database

I'm trying to understand how many days are required for the connection_health table to collect all the rows which are already present in the current crawls table. Since the crawling process happens in an iterative manner, I expect that it will take at least a week? to achieve parity with the current data.

From this perspective, should we work on migrating the data from the old crawls table to the new connection_health table? Does anybody know how database schema changes were handled in prior versions of VHS?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should not take more than couple minutes after redeployment. We already have the crawls table so whenever websocket connections established we should have data populated into this table.

I don't think we need to migrate data here. Just need to remove that column in crawls table in redeployment by calling a knex drop column query. The connected column reflect the current websocket connection, so no data is being lost here

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The columns that are removed gets filled up instantly in connection_health table when the service starts for the rows that we care about. The purpose of connection_health is different from crawls table so I don't see a reason to replicate those two and bring unnecessary rows from crawls table to connection_health.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code in the setup file will run once whenever the service restarts just FYI

if (!hasConnectionHealth) {
await db().schema.createTable('connection_health', (table) => {
table.string('ws_url').primary()
table.string('public_key').references('crawls.public_key')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many nodes (as identified by their public_key's ) use the same ws_url endpoint for communication. I have observed this empirically in the databases.

select count(distinct ws_url) from public.crawls c ; // returns 538
select count(distinct public_key) from public.crawls c where c.ws_url is not NULL; // returns 1087

If we choose ws_url as the primary key of this table, then public_key column will need to accommodate multiple values.

Copy link
Author

@Patel-Raj11 Patel-Raj11 Apr 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Existing design:

While creating a connection we pick those records from crawls table having their start > now() - 10 minutes. So, currently if a node changes its public key (either due to upgrading rippled server version, purposefully changing it, or switching from one network to another) and still remains listening same ws_url there would be a new row inserted into crawls table with a recent start date since we crawl every 2 minutes.

Now, since we pick only the rows having recent start date (i.e less then 10 mins old) the other old rows for same ws_url are no more relevant and stale.

Design after introducing connection_health:

If a node switches its public key and keeps serving from the same ws_url, we would update the existing row in connection_health table to depict the latest state, and since, we have public_key mapped as foreign key to public_key in crawls table, we can always extra details of that node.

I don't see a reason to have multiple public keys for a ws_url in connection_health table as its purpose is to show current state of connections.

For ref:

select ws_url from crawls where ws_url is not null group by ws_url having count(crawls.public_key)> 1 ;
select * from crawls where ws_url = 'ws://135.181.212.21:51233/';

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we make both ws_url and pubkey primary? Since if 2 nodes share the same ws_url, we still open 2 websocket connections, which would create 2 rows in this table

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pdp2121 ws_url comprises of ip and port information. So two nodes will be serving from two different ports even if they run on a single machine. So ws_url would still be unique.

Moreover, composite primary key should not have a null column and public_key is a candidate null column for the entries fetched from networks table.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Patel-Raj I'm still confused

I don't see a reason to have multiple public keys for a ws_url in connection_health

Can you explain the results of this query?

SELECT ws_url, COUNT(public_key) AS shared_public_key_count FROM crawls WHERE ws_url IS NOT null GROUP BY ws_url HAVING COUNT(public_key) > 1;
image

This result indicates that each of these 176 ws_url entries are shared by multiple unique public_keys. Your proposed design will over-write the existing public_key information. Aren't we losing useful data in this process?

Now, since we pick only the rows having recent start date (i.e less then 10 mins old) the other old rows for same ws_url are no more relevant and stale.

Am I understanding correctly that rows older than 10 minutes are not useful in the crawls table? I'm working on a different task for optimizing the space usage of the database. I think we should purge old data and keep the crawls table lean.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

176 ws_url entries have port information along with their protocol and ip as seen in screenshot. So, one ws_url at a time can have only one public_key (thus node) actively listening to web socket connection.

Also, multiple public keys which we see are from different start times. You can pick one of the ws_url and query the crawls table to see the start time. So, they are not being used at the same time.

I don't think we loose any information.

Yes, rows older than 10 minutes are stale and of now use. We can write a cleanup job to remove those.

@@ -227,6 +227,21 @@ interface AmendmentInfo {
deprecated: boolean
}

interface ConnectionHealth {
ws_url: string
public_key?: string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this parameter optional? If we do not have the public_key of a node, we will not be able to verify the correctness of its digital signatures. Am I missing any cases where we do not have a node's public key ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const networksDb = await getNetworks() networksDb.forEach((network) => { nodes.push({ ip: network.entry, ws_url: '', networks: network.id, }) })

The rows that are picked by this query don't have public key. And this was the primary reason to introduce connection_health table so that we can keep track of connections that don't have public key.

At one point I thought why not remove this query and all the connections that we create from it all together, but then we are left with only one node for Xahau Mainnet and that is not ideal (For now its Xahau Mainnet but could occur with Testnet or other side chains in future). Also, crawls table is not a good place to store connection health, so having optional public_key helps us to track the entire state of connections.

)

if (connections.get(ws.url)?.url === ws.url) {
connections.delete(ws.url)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying to understand the use of the connections: Map<string, WebSocket> map. I understand that it is a cached value representing a membership-summary of the Connection-Health table.

Can't we simply read from the table instead of maintaining this cache? I believe databases are optimized to the hilt today. If we can defer this responsibility onto the db, we can simplify our code base.

Copy link
Author

@Patel-Raj11 Patel-Raj11 Apr 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good point. Let me try that out. Previously when I tried, it turned out that we were bombarding the database with to many requests on startup and getting deadlock. Since, most of the exhaustive tries were turning into either close/error callback before even an open callback and all those occurred near to each other.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I agree we dont need this anymore with new monitoring added. Good catch

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing in-memory map and directly trying update the database if the connection closes, brings back this error:

Failed to update connection status for ws://72.44.58.240:6006/: Knex: Timeout acquiring a connection. The pool is probably full. Are you missing a .transacting(trx) call?

I am trying to increase acquireconnectiontimeout to 2 minutes and see if it helps to resolve this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @Patel-Raj. Sorry, I didn't mean to dump this task onto you. I'm happy to take a look at this issue in a later PR. This was a tangential suggestion, not directly related to your goal.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ckeshava This change suggested and acquireconnectiontimeout increase to 2 minutes is already in the PR. Just in case you missed it to review.

*/
export async function getNodes(sinceStartDate: Date): Promise<WsNode[]> {
return query('crawls as c')
.leftJoin('connection_health as ch', 'c.public_key', 'ch.public_key')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we are doing a join operation on the ch table, it becomes imperative to preserve all values of ch.public_key. I'm stating this because ch.ws_url is the PK of the ch table.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think due to reason mentioned in #337 (comment) we can have only ws_url as primary key since, at one point in time there will be one connection for a ws_url since it contains IP and port information.

@@ -77,4 +83,26 @@ describe('Amendments', () => {
)
expect(amendmentStatus[0].eta).toBe(null)
})

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add unit tests for the other new methods pertaining to handling the amendments?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added this test save Amendments ledger_entry which tests for fetchAmendmentsFromLedgerEntry to check the data returned from ledger_entry is of correct shape and gets saved properly.

For processEnableAmendmentTransaction the data returned by ledger command is same as the existing tests that we have (e.g. ledgerResponseNoFlag, ledgerResponseGotMajority etc). So, it already gets tested as we are in turn calling the same method handleWsMessageLedgerEnableAmendments

There is no logic here in these methods. What else do you think can be tested here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was referring to the checkAndHandleEnableAmendmentLedger method, which is newly added in this PR. Is this method indirectly tested via any other unit tests? (I don't believe that is the case)

Copy link
Author

@Patel-Raj11 Patel-Raj11 May 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

checkAndHandleEnableAmendmentLedger this method is just a refactor of the previous if conditions to make it cleaner and its not exported so I cannot test it in isolation. Also, the entire setHandlers is not tested and I don't have clear idea on how to test it. We can take this task of writing tests for setHandlers in separate effort.

network: string,
hostName: string,
): Promise<void> {
const allUrls: string[] = []
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add log at the beginning and ending of this process?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same for other processes as well. Would be something like starting ... and finished ...

@Patel-Raj11 Patel-Raj11 requested review from ckeshava and pdp2121 April 30, 2025 13:21
@@ -17,6 +17,7 @@ DB_HOST=
DB_USER=
DB_DATABASE=
DB_PASSWORD=
ACQUIRE_CONNECTION_TIMEOUT=
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this to be env variable instead of a constant?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To remain consistent with other database configs and having it as env variable enables us to change it without changing the code and have a different value for staging, dev etc.

@@ -82,6 +82,17 @@ const info = {
example:
'https://data.xrpl.org/v1/network/amendments/vote/{network}/{identifier}',
},
{
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is for internal use I don't think we need to expose it in info

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ckeshava If you don't feel strongly about having this and /health, I can remove it from here and ARCHITECTURE.md.

@pdp2121 Do we remove /health endpoint from ARCHITECTURE.md as well? Since, its been there from beginning.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if its for internal use, I feel a good documentation is helpful for future development.

@pdp2121 Are there any concerns about privacy/security? Are there are disadvantages to exposing it?

example: 'https://data.xrpl.org/v1/health',
},
{
action:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

`Websocket closed for ${ws.url} on ${
networks ?? 'unknown network'
} with code ${code} and reason ${reason.toString('utf-8')}.`,
`Websocket closed for ${
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if we need this log now with your change. It seems to flood the logs atmm

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pdp2121 This is only flooding the logs initially when the service starts up. At any other point in time it will give valuable information about when the connection was terminated and with what reason. We can keep these two logs for now and remove later once we don't see any value. What do you think?

`Websocket connection error for ${ws.url} on ${
networks ?? 'unknown network'
} - ${err.message}`,
`Websocket connection error for ${ws.url} on ${network} - ${err.message}`,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here. Since we try multiple ports for an url, this will raise lots of errors that has no important implication

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants