-
Notifications
You must be signed in to change notification settings - Fork 10
Expose WebSocket connection count as prometheus metrics and some refactors #337
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…to refactor-make-connected-usable
…to refactor-make-connected-usable
Should these metrics only be exposed with a token of some sort, so they're not public? |
These metrics are similar to what we had before but now, we are making them network specific and more accurate. I think they can remain public as they are health check metrics. Do you see a particular reason to make this endpoint authenticated? |
I would prefer these metrics to be tracked via Slack compared to exposing it publicly via API, since this is being used for internal monitoring purpose |
Would be something similar to what we do with VCR: |
These metrics are going to be consumed by Grafana since they are exposed in prometheus exposition format for health checks and in-turn we will receive slack alerts from Grafana (as we do in #ripplex-alerting-dge-non-prod slack channel). In my opinion this is better approach since we will have decoupled alerting and application logic and we can move to a different alerting application like PagerDuty without modifying application logic at all. Just my thoughts, what do you think? However, if exposing the number of connected nodes as public API is a concern, then are we going to deprecate /health endpoint as well? Since, that endpoint has been exposing the same metric but in JSON format and for all nodes. I can create separate PR if we opt for push based approach to send messages to slack since it would be cleaner to review it separately. |
If that's the case then I'm okay with exposing this data publicly since we have already listed number of connected node in |
if (await db().schema.hasColumn('crawls', 'ws_url')) { | ||
await db().schema.alterTable('crawls', (table) => { | ||
table.dropColumn('ws_url') | ||
}) | ||
} | ||
|
||
if (await db().schema.hasColumn('crawls', 'connected')) { | ||
await db().schema.alterTable('crawls', (table) => { | ||
table.dropColumn('connected') | ||
}) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what are the implications of deleting this data? The crawls
table has been populated over several iterations. I'd imagine that it takes at least a few days to re-populate the Connection-Health table with the equivalent set of records.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are removing ws_url
and connected
columns from crawls table. Those were populated when we create a connection to the node using vhs-connections
service. There are no implications deleting this data. These columns in connections_health
will be populated once the service starts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
select count(distinct public_key) from public.crawls c where c.ws_url is not null; // returns 1089 in current database
I'm trying to understand how many days are required for the connection_health
table to collect all the rows which are already present in the current crawls
table. Since the crawling process happens in an iterative manner, I expect that it will take at least a week? to achieve parity with the current data.
From this perspective, should we work on migrating the data from the old crawls
table to the new connection_health
table? Does anybody know how database schema changes were handled in prior versions of VHS?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should not take more than couple minutes after redeployment. We already have the crawls table so whenever websocket connections established we should have data populated into this table.
I don't think we need to migrate data here. Just need to remove that column in crawls
table in redeployment by calling a knex drop column query. The connected
column reflect the current websocket connection, so no data is being lost here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The columns that are removed gets filled up instantly in connection_health
table when the service starts for the rows that we care about. The purpose of connection_health
is different from crawls
table so I don't see a reason to replicate those two and bring unnecessary rows from crawls table to connection_health.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code in the setup file will run once whenever the service restarts just FYI
if (!hasConnectionHealth) { | ||
await db().schema.createTable('connection_health', (table) => { | ||
table.string('ws_url').primary() | ||
table.string('public_key').references('crawls.public_key') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Many nodes (as identified by their public_key
's ) use the same ws_url
endpoint for communication. I have observed this empirically in the databases.
select count(distinct ws_url) from public.crawls c ; // returns 538
select count(distinct public_key) from public.crawls c where c.ws_url is not NULL; // returns 1087
If we choose ws_url
as the primary key of this table, then public_key
column will need to accommodate multiple values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Existing design:
While creating a connection we pick those records from crawls table having their start
> now() - 10 minutes
. So, currently if a node changes its public key (either due to upgrading rippled server version, purposefully changing it, or switching from one network to another) and still remains listening same ws_url
there would be a new row inserted into crawls
table with a recent start date
since we crawl every 2 minutes.
Now, since we pick only the rows having recent start date (i.e less then 10 mins old) the other old rows for same ws_url are no more relevant and stale.
Design after introducing connection_health
:
If a node switches its public key and keeps serving from the same ws_url
, we would update the existing row in connection_health table to depict the latest state, and since, we have public_key
mapped as foreign key to public_key
in crawls table, we can always extra details of that node.
I don't see a reason to have multiple public keys for a ws_url in connection_health table as its purpose is to show current state of connections.
For ref:
select ws_url from crawls where ws_url is not null group by ws_url having count(crawls.public_key)> 1 ;
select * from crawls where ws_url = 'ws://135.181.212.21:51233/';
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we make both ws_url and pubkey primary? Since if 2 nodes share the same ws_url, we still open 2 websocket connections, which would create 2 rows in this table
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pdp2121 ws_url
comprises of ip and port information. So two nodes will be serving from two different ports even if they run on a single machine. So ws_url
would still be unique.
Moreover, composite primary key should not have a null column and public_key is a candidate null column for the entries fetched from networks table.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Patel-Raj I'm still confused
I don't see a reason to have multiple public keys for a ws_url in connection_health
Can you explain the results of this query?
SELECT ws_url, COUNT(public_key) AS shared_public_key_count FROM crawls WHERE ws_url IS NOT null GROUP BY ws_url HAVING COUNT(public_key) > 1;

This result indicates that each of these 176 ws_url
entries are shared by multiple unique public_key
s. Your proposed design will over-write the existing public_key information. Aren't we losing useful data in this process?
Now, since we pick only the rows having recent start date (i.e less then 10 mins old) the other old rows for same ws_url are no more relevant and stale.
Am I understanding correctly that rows older than 10 minutes are not useful in the crawls
table? I'm working on a different task for optimizing the space usage of the database. I think we should purge old data and keep the crawls
table lean.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
176 ws_url entries
have port information along with their protocol and ip as seen in screenshot. So, one ws_url at a time can have only one public_key (thus node) actively listening to web socket connection.
Also, multiple public keys which we see are from different start times. You can pick one of the ws_url
and query the crawls table to see the start time. So, they are not being used at the same time.
I don't think we loose any information.
Yes, rows older than 10 minutes are stale and of now use. We can write a cleanup job to remove those.
@@ -227,6 +227,21 @@ interface AmendmentInfo { | |||
deprecated: boolean | |||
} | |||
|
|||
interface ConnectionHealth { | |||
ws_url: string | |||
public_key?: string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this parameter optional? If we do not have the public_key
of a node, we will not be able to verify the correctness of its digital signatures. Am I missing any cases where we do not have a node's public key ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const networksDb = await getNetworks() networksDb.forEach((network) => { nodes.push({ ip: network.entry, ws_url: '', networks: network.id, }) })
The rows that are picked by this query don't have public key. And this was the primary reason to introduce connection_health
table so that we can keep track of connections that don't have public key.
At one point I thought why not remove this query and all the connections that we create from it all together, but then we are left with only one node for Xahau Mainnet and that is not ideal (For now its Xahau Mainnet but could occur with Testnet or other side chains in future). Also, crawls
table is not a good place to store connection health, so having optional public_key helps us to track the entire state of connections.
) | ||
|
||
if (connections.get(ws.url)?.url === ws.url) { | ||
connections.delete(ws.url) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm trying to understand the use of the connections: Map<string, WebSocket>
map. I understand that it is a cached value representing a membership-summary of the Connection-Health table.
Can't we simply read from the table instead of maintaining this cache? I believe databases are optimized to the hilt today. If we can defer this responsibility onto the db, we can simplify our code base.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good point. Let me try that out. Previously when I tried, it turned out that we were bombarding the database with to many requests on startup and getting deadlock. Since, most of the exhaustive tries were turning into either close/error callback before even an open callback and all those occurred near to each other.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I agree we dont need this anymore with new monitoring added. Good catch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removing in-memory map and directly trying update the database if the connection closes, brings back this error:
Failed to update connection status for ws://72.44.58.240:6006/: Knex: Timeout acquiring a connection. The pool is probably full. Are you missing a .transacting(trx) call?
I am trying to increase acquireconnectiontimeout to 2 minutes and see if it helps to resolve this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @Patel-Raj. Sorry, I didn't mean to dump this task onto you. I'm happy to take a look at this issue in a later PR. This was a tangential suggestion, not directly related to your goal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ckeshava This change suggested and acquireconnectiontimeout increase to 2 minutes is already in the PR. Just in case you missed it to review.
*/ | ||
export async function getNodes(sinceStartDate: Date): Promise<WsNode[]> { | ||
return query('crawls as c') | ||
.leftJoin('connection_health as ch', 'c.public_key', 'ch.public_key') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we are doing a join operation on the ch
table, it becomes imperative to preserve all values of ch.public_key
. I'm stating this because ch.ws_url
is the PK of the ch
table.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think due to reason mentioned in #337 (comment) we can have only ws_url
as primary key since, at one point in time there will be one connection for a ws_url
since it contains IP and port information.
@@ -77,4 +83,26 @@ describe('Amendments', () => { | |||
) | |||
expect(amendmentStatus[0].eta).toBe(null) | |||
}) | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add unit tests for the other new methods pertaining to handling the amendments?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added this test save Amendments ledger_entry
which tests for fetchAmendmentsFromLedgerEntry
to check the data returned from ledger_entry is of correct shape and gets saved properly.
For processEnableAmendmentTransaction
the data returned by ledger command is same as the existing tests that we have (e.g. ledgerResponseNoFlag, ledgerResponseGotMajority etc). So, it already gets tested as we are in turn calling the same method handleWsMessageLedgerEnableAmendments
There is no logic here in these methods. What else do you think can be tested here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was referring to the checkAndHandleEnableAmendmentLedger
method, which is newly added in this PR. Is this method indirectly tested via any other unit tests? (I don't believe that is the case)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
checkAndHandleEnableAmendmentLedger
this method is just a refactor of the previous if conditions to make it cleaner and its not exported so I cannot test it in isolation. Also, the entire setHandlers
is not tested and I don't have clear idea on how to test it. We can take this task of writing tests for setHandlers
in separate effort.
network: string, | ||
hostName: string, | ||
): Promise<void> { | ||
const allUrls: string[] = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add log at the beginning and ending of this process?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same for other processes as well. Would be something like starting ... and finished ...
@@ -17,6 +17,7 @@ DB_HOST= | |||
DB_USER= | |||
DB_DATABASE= | |||
DB_PASSWORD= | |||
ACQUIRE_CONNECTION_TIMEOUT= |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this to be env variable instead of a constant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To remain consistent with other database configs and having it as env variable enables us to change it without changing the code and have a different value for staging, dev etc.
@@ -82,6 +82,17 @@ const info = { | |||
example: | |||
'https://data.xrpl.org/v1/network/amendments/vote/{network}/{identifier}', | |||
}, | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is for internal use I don't think we need to expose it in info
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even if its for internal use, I feel a good documentation is helpful for future development.
@pdp2121 Are there any concerns about privacy/security? Are there are disadvantages to exposing it?
example: 'https://data.xrpl.org/v1/health', | ||
}, | ||
{ | ||
action: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
`Websocket closed for ${ws.url} on ${ | ||
networks ?? 'unknown network' | ||
} with code ${code} and reason ${reason.toString('utf-8')}.`, | ||
`Websocket closed for ${ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if we need this log now with your change. It seems to flood the logs atmm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pdp2121 This is only flooding the logs initially when the service starts up. At any other point in time it will give valuable information about when the connection was terminated and with what reason. We can keep these two logs for now and remove later once we don't see any value. What do you think?
`Websocket connection error for ${ws.url} on ${ | ||
networks ?? 'unknown network' | ||
} - ${err.message}`, | ||
`Websocket connection error for ${ws.url} on ${network} - ${err.message}`, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here. Since we try multiple ports for an url, this will raise lots of errors that has no important implication
High Level Overview of Change
This PR accomplishes the following two tasks:
ws_url
andconnected
columns fromcrawls
table and addconnection_health
to capture their true values.Context of Change
ws_url
andconnected
columns fromcrawls
table were not capturing all the active WebSocket connections since, we were creating additional WebSocket connections from usingentry
column ofnetworks
table as well.This change, adds a
connection_health
table that captures all the WebSocket connections and their status update time. It also, fixes the bug ofconnected
column ofcrawls
remaining stale even after loosing a WebSocket connection./metrics/:network
that can be queried by Grafana to capture the number of connected nodes for each network and configure alerts.setHandlers
in order to makesetHandlers
more cleaner.Type of Change
Before / After
No change in the working on validator-history-service
Test Plan
Added tests in /tests directory to check for /metrics/:network endpoint, database updates, selects, and utils methods that were introduced by this change.
I will deploy this branch on staging to check its working on stating before releasing on production.