Expose WebSocket connection count as prometheus metrics and some refactors #337

Patel-Raj11 · 2025-04-28T14:16:16Z

High Level Overview of Change

This PR accomplishes the following two tasks:

Deprecate ws_url and connected columns from crawls table and add connection_health to capture their true values.
Expose number of active WebSocket connections for a network in prometheus exposition format.

Context of Change

ws_url and connected columns from crawls table were not capturing all the active WebSocket connections since, we were creating additional WebSocket connections from using entry column of networks table as well.
This change, adds a connection_health table that captures all the WebSocket connections and their status update time. It also, fixes the bug of connected column of crawls remaining stale even after loosing a WebSocket connection.
Exposes /metrics/:network that can be queried by Grafana to capture the number of connected nodes for each network and configure alerts.
Refactor the code to use xrpl.js library to fetch EnableAmendment Transaction and Amendments ledger entry as opposed to listening WebSocket events in setHandlers in order to make setHandlers more cleaner.

Type of Change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Refactor (non-breaking change that only restructures code)
Tests (You added tests for code that already exists, or your new feature included in this PR)
Documentation Updates
Release

Before / After

No change in the working on validator-history-service

Test Plan

Added tests in /tests directory to check for /metrics/:network endpoint, database updates, selects, and utils methods that were introduced by this change.
I will deploy this branch on staging to check its working on stating before releasing on production.

…to refactor-make-connected-usable

mvadari · 2025-04-28T15:08:04Z

Should these metrics only be exposed with a token of some sort, so they're not public?

Patel-Raj11 · 2025-04-28T15:46:46Z

Should these metrics only be exposed with a token of some sort, so they're not public?

These metrics are similar to what we had before but now, we are making them network specific and more accurate. I think they can remain public as they are health check metrics. Do you see a particular reason to make this endpoint authenticated?

pdp2121 · 2025-04-28T17:53:25Z

I would prefer these metrics to be tracked via Slack compared to exposing it publicly via API, since this is being used for internal monitoring purpose

pdp2121 · 2025-04-28T18:02:09Z

Would be something similar to what we do with VCR:
https://github.com/xpring-eng/validation-count-reporter

Patel-Raj11 · 2025-04-28T20:21:50Z

I would prefer these metrics to be tracked via Slack compared to exposing it publicly via API, since this is being used for internal monitoring purpose

@pdp2121

These metrics are going to be consumed by Grafana since they are exposed in prometheus exposition format for health checks and in-turn we will receive slack alerts from Grafana (as we do in #ripplex-alerting-dge-non-prod slack channel). In my opinion this is better approach since we will have decoupled alerting and application logic and we can move to a different alerting application like PagerDuty without modifying application logic at all. Just my thoughts, what do you think?

However, if exposing the number of connected nodes as public API is a concern, then are we going to deprecate /health endpoint as well? Since, that endpoint has been exposing the same metric but in JSON format and for all nodes.

I can create separate PR if we opt for push based approach to send messages to slack since it would be cleaner to review it separately.

pdp2121 · 2025-04-28T20:45:05Z

These metrics are going to be consumed by Grafana since they are exposed in prometheus exposition format for health checks and in-turn we will receive slack alerts from Grafana (as we do in #ripplex-alerting-dge-non-prod slack channel). In my opinion this is better approach since we will have decoupled alerting and application logic and we can move to a different alerting application like PagerDuty without modifying application logic at all. Just my thoughts, what do you think?

If that's the case then I'm okay with exposing this data publicly since we have already listed number of connected node in health endpoint and we don't show this endpoint on the info list. @mvadari please let us know if you have any concerns

src/connection-manager/connections.ts

ckeshava · 2025-04-28T21:12:07Z

src/shared/database/setup.ts

+  if (await db().schema.hasColumn('crawls', 'ws_url')) {
+    await db().schema.alterTable('crawls', (table) => {
+      table.dropColumn('ws_url')
+    })
+  }
+
+  if (await db().schema.hasColumn('crawls', 'connected')) {
+    await db().schema.alterTable('crawls', (table) => {
+      table.dropColumn('connected')
+    })
+  }


what are the implications of deleting this data? The crawls table has been populated over several iterations. I'd imagine that it takes at least a few days to re-populate the Connection-Health table with the equivalent set of records.

We are removing ws_url and connected columns from crawls table. Those were populated when we create a connection to the node using vhs-connections service. There are no implications deleting this data. These columns in connections_health will be populated once the service starts.

select count(distinct public_key) from public.crawls c where c.ws_url is not null; // returns 1089 in current database

I'm trying to understand how many days are required for the connection_health table to collect all the rows which are already present in the current crawls table. Since the crawling process happens in an iterative manner, I expect that it will take at least a week? to achieve parity with the current data.

From this perspective, should we work on migrating the data from the old crawls table to the new connection_health table? Does anybody know how database schema changes were handled in prior versions of VHS?

Should not take more than couple minutes after redeployment. We already have the crawls table so whenever websocket connections established we should have data populated into this table.

I don't think we need to migrate data here. Just need to remove that column in crawls table in redeployment by calling a knex drop column query. The connected column reflect the current websocket connection, so no data is being lost here

The columns that are removed gets filled up instantly in connection_health table when the service starts for the rows that we care about. The purpose of connection_health is different from crawls table so I don't see a reason to replicate those two and bring unnecessary rows from crawls table to connection_health.

The code in the setup file will run once whenever the service restarts just FYI

ckeshava · 2025-04-28T21:20:51Z

src/shared/database/setup.ts

+  if (!hasConnectionHealth) {
+    await db().schema.createTable('connection_health', (table) => {
+      table.string('ws_url').primary()
+      table.string('public_key').references('crawls.public_key')


Many nodes (as identified by their public_key's ) use the same ws_url endpoint for communication. I have observed this empirically in the databases.

select count(distinct ws_url) from public.crawls c ; // returns 538 select count(distinct public_key) from public.crawls c where c.ws_url is not NULL; // returns 1087

If we choose ws_url as the primary key of this table, then public_key column will need to accommodate multiple values.

Existing design:

While creating a connection we pick those records from crawls table having their start > now() - 10 minutes. So, currently if a node changes its public key (either due to upgrading rippled server version, purposefully changing it, or switching from one network to another) and still remains listening same ws_url there would be a new row inserted into crawls table with a recent start date since we crawl every 2 minutes.

Now, since we pick only the rows having recent start date (i.e less then 10 mins old) the other old rows for same ws_url are no more relevant and stale.

Design after introducing connection_health:

If a node switches its public key and keeps serving from the same ws_url, we would update the existing row in connection_health table to depict the latest state, and since, we have public_key mapped as foreign key to public_key in crawls table, we can always extra details of that node.

I don't see a reason to have multiple public keys for a ws_url in connection_health table as its purpose is to show current state of connections.

For ref:

select ws_url from crawls where ws_url is not null group by ws_url having count(crawls.public_key)> 1 ; select * from crawls where ws_url = 'ws://135.181.212.21:51233/';

Should we make both ws_url and pubkey primary? Since if 2 nodes share the same ws_url, we still open 2 websocket connections, which would create 2 rows in this table

@pdp2121 ws_url comprises of ip and port information. So two nodes will be serving from two different ports even if they run on a single machine. So ws_url would still be unique.

Moreover, composite primary key should not have a null column and public_key is a candidate null column for the entries fetched from networks table.

Makes sense to me

@Patel-Raj I'm still confused

I don't see a reason to have multiple public keys for a ws_url in connection_health

Can you explain the results of this query?

SELECT ws_url, COUNT(public_key) AS shared_public_key_count FROM crawls WHERE ws_url IS NOT null GROUP BY ws_url HAVING COUNT(public_key) > 1;

This result indicates that each of these 176 ws_url entries are shared by multiple unique public_keys. Your proposed design will over-write the existing public_key information. Aren't we losing useful data in this process?

Now, since we pick only the rows having recent start date (i.e less then 10 mins old) the other old rows for same ws_url are no more relevant and stale.

Am I understanding correctly that rows older than 10 minutes are not useful in the crawls table? I'm working on a different task for optimizing the space usage of the database. I think we should purge old data and keep the crawls table lean.

176 ws_url entries have port information along with their protocol and ip as seen in screenshot. So, one ws_url at a time can have only one public_key (thus node) actively listening to web socket connection.

Also, multiple public keys which we see are from different start times. You can pick one of the ws_url and query the crawls table to see the start time. So, they are not being used at the same time.

I don't think we loose any information.

Yes, rows older than 10 minutes are stale and of now use. We can write a cleanup job to remove those.

src/shared/types/index.ts

ckeshava · 2025-04-28T21:34:23Z

src/connection-manager/connections.ts

      )

+      if (connections.get(ws.url)?.url === ws.url) {
+        connections.delete(ws.url)


I'm trying to understand the use of the connections: Map<string, WebSocket> map. I understand that it is a cached value representing a membership-summary of the Connection-Health table.

Can't we simply read from the table instead of maintaining this cache? I believe databases are optimized to the hilt today. If we can defer this responsibility onto the db, we can simplify our code base.

This is a good point. Let me try that out. Previously when I tried, it turned out that we were bombarding the database with to many requests on startup and getting deadlock. Since, most of the exhaustive tries were turning into either close/error callback before even an open callback and all those occurred near to each other.

Yes I agree we dont need this anymore with new monitoring added. Good catch

Removing in-memory map and directly trying update the database if the connection closes, brings back this error:

Failed to update connection status for ws://72.44.58.240:6006/: Knex: Timeout acquiring a connection. The pool is probably full. Are you missing a .transacting(trx) call?

I am trying to increase acquireconnectiontimeout to 2 minutes and see if it helps to resolve this.

thanks @Patel-Raj. Sorry, I didn't mean to dump this task onto you. I'm happy to take a look at this issue in a later PR. This was a tangential suggestion, not directly related to your goal.

@ckeshava This change suggested and acquireconnectiontimeout increase to 2 minutes is already in the PR. Just in case you missed it to review.

src/shared/database/index.ts

src/api/routes/v1/index.ts

ckeshava · 2025-04-28T22:55:50Z

test/connections/amendment.test.ts

@@ -77,4 +83,26 @@ describe('Amendments', () => {
    )
    expect(amendmentStatus[0].eta).toBe(null)
  })
+


Can you add unit tests for the other new methods pertaining to handling the amendments?

I have added this test save Amendments ledger_entry which tests for fetchAmendmentsFromLedgerEntry to check the data returned from ledger_entry is of correct shape and gets saved properly.

For processEnableAmendmentTransaction the data returned by ledger command is same as the existing tests that we have (e.g. ledgerResponseNoFlag, ledgerResponseGotMajority etc). So, it already gets tested as we are in turn calling the same method handleWsMessageLedgerEnableAmendments

There is no logic here in these methods. What else do you think can be tested here?

I was referring to the checkAndHandleEnableAmendmentLedger method, which is newly added in this PR. Is this method indirectly tested via any other unit tests? (I don't believe that is the case)

checkAndHandleEnableAmendmentLedger this method is just a refactor of the previous if conditions to make it cleaner and its not exported so I cannot test it in isolation. Also, the entire setHandlers is not tested and I don't have clear idea on how to test it. We can take this task of writing tests for setHandlers in separate effort.

src/connection-manager/wsHandling.ts

.env.example

pdp2121 · 2025-04-30T20:09:57Z

src/api/routes/v1/info.ts

@@ -82,6 +82,17 @@ const info = {
      example:
        'https://data.xrpl.org/v1/network/amendments/vote/{network}/{identifier}',
    },
+    {


Since this is for internal use I don't think we need to expose it in info

@ckeshava If you don't feel strongly about having this and /health, I can remove it from here and ARCHITECTURE.md.

@pdp2121 Do we remove /health endpoint from ARCHITECTURE.md as well? Since, its been there from beginning.

Even if its for internal use, I feel a good documentation is helpful for future development.

@pdp2121 Are there any concerns about privacy/security? Are there are disadvantages to exposing it?

I meant we don't show it in the info when users hit https://data.xrpl.org, but still keep it within architecture document

pdp2121 · 2025-04-30T20:10:13Z

src/api/routes/v1/info.ts

+      example: 'https://data.xrpl.org/v1/health',
+    },
+    {
+      action:


src/connection-manager/connections.ts

pdp2121

Looks good to me other than the websocket closed discussion

pdp2121 · 2025-05-06T15:52:02Z

@Patel-Raj11 are you still having the branch ran on staging? Just want to check if it captures DeepFreeze correctly

Patel-Raj11 · 2025-05-06T15:56:04Z

@Patel-Raj11 are you still having the branch ran on staging? Just want to check if it captures DeepFreeze correctly

Yes, past deploys had this branch

pdp2121 · 2025-05-06T16:00:37Z

@Patel-Raj11 are you still having the branch ran on staging? Just want to check if it captures DeepFreeze correctly

Yes, past deploys had this branch

Yes staging looks good to me

src/api/routes/v1/health.ts

achowdhry-ripple

lgtm, I have limited context on this but the code makes sense to me overall -- nice job with this

test/connections/connection-health.test.ts

achowdhry-ripple · 2025-05-13T15:46:58Z

ARCHITECTURE.md

  * `/health`: A health check for the VHS. Returns the number of nodes that it is connected to.
+  * `/metrics`: A health check for the VHS. Returns the number of connected nodes for each network in prometheus exposition format.


how come we are keeping both endpoints? It seems like the new metrics endpoint is a more detailed version of the same info the health endpoint has

The new metrics endpoint returns data in prometheus exposition format and the /health returns data in JSON format. Since, it was public endpoint users of VHS might be consuming it. No harm in keeping both.

…lidator-history-service into refactor-make-connected-usable

Patel-Raj and others added 7 commits April 24, 2025 16:16

refactor websocket handling

32fc48d

Merge branch 'main' of github.com:ripple/validator-history-service in…

1d69fc6

…to refactor-make-connected-usable

add connection_health table

cad1187

add tests and refactor

1877417

add prometheus metrics

339b35e

Merge branch 'main' of github.com:ripple/validator-history-service in…

c38b316

…to refactor-make-connected-usable

fix tests

3c69b8a

Patel-Raj11 requested review from ckeshava and pdp2121 April 28, 2025 16:30

ckeshava reviewed Apr 28, 2025

View reviewed changes

pdp2121 reviewed Apr 29, 2025

View reviewed changes

src/connection-manager/wsHandling.ts Show resolved Hide resolved

remove connections in-memory map

f64b49a

pdp2121 reviewed Apr 29, 2025

View reviewed changes

src/connection-manager/wsHandling.ts Show resolved Hide resolved

Patel-Raj11 added 3 commits April 29, 2025 18:12

add tests for connection_health queries

938536b

fix tests

552c03a

Merge branch 'main' into refactor-make-connected-usable

2e99a3f

Patel-Raj11 requested review from ckeshava and pdp2121 April 30, 2025 13:21

increase connection timeout to 2 minutes

cd06b8f

pdp2121 reviewed Apr 30, 2025

View reviewed changes

.env.example Show resolved Hide resolved

pdp2121 reviewed Apr 30, 2025

View reviewed changes

src/api/routes/v1/info.ts

example: 'https://data.xrpl.org/v1/health',

},

{

action:

Copy link

Collaborator

pdp2121 Apr 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

pdp2121 reviewed Apr 30, 2025

View reviewed changes

src/connection-manager/connections.ts Outdated Show resolved Hide resolved

pdp2121 reviewed Apr 30, 2025

View reviewed changes

src/connection-manager/connections.ts Outdated Show resolved Hide resolved

make /metrics endpoint return connected status for all networks

d24e0c5

pdp2121 approved these changes May 6, 2025

View reviewed changes

Patel-Raj11 requested a review from achowdhry-ripple May 6, 2025 16:05

remove close and error logs

48797c6

achowdhry-ripple reviewed May 13, 2025

View reviewed changes

src/api/routes/v1/health.ts Show resolved Hide resolved

Merge branch 'main' into refactor-make-connected-usable

06f85b3

achowdhry-ripple approved these changes May 13, 2025

View reviewed changes

Patel-Raj11 added 2 commits May 13, 2025 12:36

fix update connection_health test

e3f5584

Merge branch 'refactor-make-connected-usable' of github.com:ripple/va…

ea076f3

…lidator-history-service into refactor-make-connected-usable

Patel-Raj11 merged commit 7622737 into main May 13, 2025
2 of 4 checks passed

Patel-Raj11 deleted the refactor-make-connected-usable branch May 13, 2025 16:41

		* `/health`: A health check for the VHS. Returns the number of nodes that it is connected to.
		* `/metrics`: A health check for the VHS. Returns the number of connected nodes for each network in prometheus exposition format.

Expose WebSocket connection count as prometheus metrics and some refactors #337

Expose WebSocket connection count as prometheus metrics and some refactors #337

Uh oh!

Conversation

Patel-Raj11 commented Apr 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

High Level Overview of Change

Context of Change

Type of Change

Before / After

Test Plan

Uh oh!

mvadari commented Apr 28, 2025

Uh oh!

Patel-Raj11 commented Apr 28, 2025

Uh oh!

pdp2121 commented Apr 28, 2025

Uh oh!

pdp2121 commented Apr 28, 2025

Uh oh!

Patel-Raj11 commented Apr 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pdp2121 commented Apr 28, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Patel-Raj11 Apr 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Existing design:

Design after introducing connection_health:

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Patel-Raj11 Apr 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Patel-Raj11 May 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Patel-Raj11 commented Apr 28, 2025 •

edited

Loading

Patel-Raj11 commented Apr 28, 2025 •

edited

Loading

Patel-Raj11 Apr 29, 2025 •

edited

Loading

Design after introducing `connection_health`:

Patel-Raj11 Apr 29, 2025 •

edited

Loading

Patel-Raj11 May 2, 2025 •

edited

Loading

Patel-Raj11 commented May 6, 2025 •

edited

Loading