Skip to content

Conversation

@ckeshava
Copy link
Contributor

@ckeshava ckeshava commented Nov 7, 2025

High Level Overview of Change

Use network_id value from the Validation subscription stream. This PR obviates the need for complex chains assignment logic in the chains.updateChains method.

Context of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Refactor (non-breaking change that only restructures code)
  • Tests (You added tests for code that already exists, or your new feature included in this PR)
  • Documentation Updates
  • Release

Before / After

There are no changes to the externally visible APIs.

Test Plan

Existing unit tests pass with this refactor. Additional unit tests have also been added to test the logic of chains.

@pdp2121
Copy link
Collaborator

pdp2121 commented Nov 12, 2025

Just a few initial things:

  • We should save the networkID into the networks table.
  • We can remove Chain related fields, storing validations in a map (key: networkID) and process on an hourly basis.
  • Please make sure to add unit tests

@ckeshava
Copy link
Contributor Author

ckeshava commented Nov 19, 2025

Correctness Report for this PR

The new changes have been deployed on the VHS staging area since Nov 15, 2025. This post records the number of missed ledgers since the VHS logic has been updated to make use of network_id field.

The missed validators are recorded by the error log: Ledgers are not consecutive. Void between indices: .... Please look at the attached log files for more examples.

Date Number of missed ledgers
15 Nov 2025 4
16 Nov 2025 1
17 Nov 2025 2
18 Nov 2025 6
19 Nov 2025 6
25 Nov 2025 13
26 Nov 2025 18
27 Nov 2025 11
28 Nov 2025 3
29 Nov 2025 21
30 Nov 2025 9
1 Dec 2025 6
2 Dec 2025 14

Note: The data for 19 Nov 2025 is measured until 10.20 AM PST.

Possible reasons for the loss of ledgers in VHS

  1. If the validationReceived messages do not have the network_id field, then such messages are discarded. This reduces the contributing validators to those which have upgraded to rippled v2.6.0 or newer. Since the number of such validators are less, I expect to see flakiness in the web-socket connection robustness.
  2. WebSockets are refreshed only once every hour, with an exponential backoff strategy. If the connection were to drop in the middle, we will lose validation data from that node/validator. As documented by another PR, there are known WIPs with regards to the lifecycle of web-sockets in VHS.

I will continue to monitor and investigate the precise reasons contributing to the loss of validations and ledgers.

Merits of this PR

Here is a comparison between the performance of the staging environment (with the current PR changes) and the production VHS (without any of these PR changes):

Total number of ledgers processed in VHS prod:
image

Total number of ledgers processed in VHS staging:
image

Note: I have noticed two days where the prod-VHS has more ledgers than the stg-VHS by ~40 ledgers. This can be attributed to the reason #1 explained above. However, I'd be remiss not to mention the advantages of this PR. The row of Nov 22, 2025 shows that this PR captured ~3000 ledgers more than the ones recorded by the VHS-prod environment. The same can be observed on Nov 20, 2025 as well (an improvement of ~500 ledgers).

Appendix

The below files contain the relevant logs for each of the listed days:
Missed_Ledgers_Nov15_VHS_Stg.txt
Missed_Ledgers_Nov16_VHS_Stg.txt
Missed_Ledgers_Nov17_VHS_Stg.txt
Missed_Ledgers_Nov18_VHS_Stg.txt
Missed_Ledgers_Nov19_VHS_Stg.txt

2025-11-26-Missing-Ledgers.txt
2025-11-27-Missing-Ledgers.txt
2025-11-28-Missing-Ledgers.txt
2025-11-29-Missing-Ledgers.txt
2025-11-30-Missing-Ledgers.txt
2025-12-01-Missing-Ledgers.txt
2025-12-02-Missing-Ledgers.txt

cc @pdp2121 @kuan121

@ckeshava ckeshava requested review from kuan121 and pdp2121 November 21, 2025 20:52
@ckeshava
Copy link
Contributor Author

Just a few initial things:

  • We should save the networkID into the networks table.

I have added the network_id field into both networks and crawls table. 699c0a6

  • We can remove Chain related fields, storing validations in a map (key: networkID) and process on an hourly basis.

^^ Which map are you referring to in this comment? I'm aware of ledgersByHash map in the Chains module. However, I'm not quite sure how network_id fits into that picture.

  • Please make sure to add unit tests

Unit tests have been added to validate the new changes to the code base.


setInterval(async () => {
log.info(`${await getTotalConnectedNodes()} connections established`)
log.warn(`${await getTotalConnectedNodes()} connections established`)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a warning log?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reverted this change in c2e7c26

this.updateChains(ledger)
}

// Note: The below loop is purely used to debug the XRPL Mainnet. There are no functional side-effects from this loop.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you plan to have these lines of code in production forever? Or they are just for debugging purpose and will be removed eventually.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to keep them in production for ease-of-debugging. They output log-lines into Grafana/k8s-pod/kubectl and they have very little impact on the latency of responsibilities of the VHS.

Copy link

@kuan121 kuan121 Nov 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

latency of responsibilities of the VHS

Latency isn’t the issue. My concerns are:

  1. We’re introducing special-case logic for specific networks and validators.
  2. The method calculateChainsFromLedgers is intended to perform calculations, but it’s now filled with debugging-related logging code. Have we considered better encapsulation—for example, moving this logic into a separate function, optionally in its own file, and calling it from here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We’re introducing special-case logic for specific networks and validators.

Yes, we are introducing special logic for the XRPL Mainnet, owing to its importance to the ecosystem. If logs are emitted for all the other XRPL networks, it is tedious to monitor + debug them.

There is no special handling of specific validators in this case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The method calculateChainsFromLedgers is intended to perform calculations, but it’s now filled with debugging-related logging code. Have we considered better encapsulation—for example, moving this logic into a separate function, optionally in its own file, and calling it from here?

Ok, I have incorporated this suggestion here: ac7b815

Comment on lines 98 to 101
if (
validation.network_id === 0 &&
Number(validation.ledger_index) < 100000000
) {
Copy link

@kuan121 kuan121 Nov 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will never happen since we have already passed 100 million ledgers, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had observed this situation on a development branch of the VHS in my local computer. However, I'm not able to show you any examples on the stg environment yet.

Unfortunately, I closed the terminal which contained this output on my local computer.

Some validator was replaying ledgers from 2013 from the XRPL Mainnet. Since it contained the correct network_id: 0 and correct cryptographic signatures, that was polluting the ledger-space of all the Mainnet validators.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for providing the context.

Some validator was replaying ledgers from 2013 from the XRPL Mainnet.

Using Number(validation.ledger_index) < 100000000 can't solve this replay issue since 100,000,000 will soon be considered as old.

that was polluting the ledger-space of all the Mainnet validators.

I’m not fully clear on this part. Do you mind elaborating it with an example?
If this kind of replay has always been possible, what changed that makes it a concern now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using Number(validation.ledger_index) < 100000000 can't solve this replay issue since 100,000,000 will soon be considered as old.

You are right, I'm trying to think of a more elegant solution to this problem. The chains.current data member makes a good candidate to be used here. However, that would need a significant refactor of the code.

I’m not fully clear on this part. Do you mind elaborating it with an example?

Suppose you pick up the XRPL Mainnet ledger whose index is 80_000. You already know the contents of the ledger -- the sequence number, the order and content of transactions, etc. You now create a new validator (public master_key and private-key pair) and sign this ledger. You now broadcast this signed+validated-ledger through the XRPL overlay network.

Note: This does not violate any rules of the XRPL consensus protocol. However, for the purposes of the VHS, it includes a really old ledger (whose index is 80_000) in the current set of XRPL Mainnet ledgers. Hence, the VHS must consciously discard such old ledgers.

If this kind of replay has always been possible, what changed that makes it a concern now?

This is a bug in VHS that has existed since its inception. I discovered it a few days ago while generating the correctness report for this PR. I believe such old ledgers have always contaminated the agreement score calculation for the VHS.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems a bit adhoc to me. Maybe we can consider remove ledgers that's fall outside of a certain range of last processed range? I'm just trying to think of a more elegant way here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have come up with a more general solution here: 0625e3a

There are no hard-coded numbers anymore.

@ckeshava ckeshava requested a review from kuan121 November 24, 2025 22:28
@pdp2121
Copy link
Collaborator

pdp2121 commented Nov 25, 2025

There are about 30% of validators not on v2.6.0+, how will this affect our calculation?

@ckeshava
Copy link
Contributor Author

There are about 30% of validators not on v2.6.0+, how will this affect our calculation?

The changes in this PR discard validations if they do not contain the network_id field. Validators which not have not upgraded will not be contributing to the VHS's validation stream.

However, this does not necessarily mean that 30% of all-unique-validations are discarded by the VHS, this is explained in the correctness-report above.

Even if duplicate validations are discarded, that does not affect the consensus process. However, unique validations (unique based on their signing keys) could be discard, due to the lack of network_id field.

@pdp2121
Copy link
Collaborator

pdp2121 commented Dec 2, 2025

There's also validators network categorization in these parts:
https://github.com/ripple/validator-history-service/blob/main/src/connection-manager/wsHandling.ts#L93-L107
https://github.com/ripple/validator-history-service/blob/main/src/connection-manager/agreement.ts#L249-L258
Validators are using networks column instead of network_id
Should this be changed?

@ckeshava
Copy link
Contributor Author

ckeshava commented Dec 2, 2025

There's also validators network categorization in these parts: https://github.com/ripple/validator-history-service/blob/main/src/connection-manager/wsHandling.ts#L93-L107 https://github.com/ripple/validator-history-service/blob/main/src/connection-manager/agreement.ts#L249-L258 Validators are using networks column instead of network_id Should this be changed?

Yes, I considered this. I felt it could wait until a later date, when users express a need for a change in this table.

As far as my knowledge goes, this value is not consumed within the VHS or other well-known downstream services. It would be an addition of the network_id column and the degradation of the networks column. However, that introduces a BREAKING CHANGE in the validators-API.

Hence, I feel we can wait for a few more months to ensure that the agreement scores and ledgers are computed correctly.

…sed on historical records of the validators' public key
// eslint-disable-next-line @typescript-eslint/no-unnecessary-condition -- older rippled binaries do not return a network_id field
if (validation.network_id === undefined) {
const networkNameToChainID = new Map<string, number>()
networkNameToChainID.set('mainnet', 0)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use the network table for this instead of hardcoding. It will be cleaner and also ensuring that we have all the networks

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. I have updated it in this commit: d4f8cea

entry: string
unls: string[]
// Context about the network_id field in rippled: https://xrpl.org/docs/references/protocol/transactions/common-fields#networkid-field
network_id: number
Copy link
Collaborator

@pdp2121 pdp2121 Dec 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should try to get the network_id for existing networks (in networks table) not belong to this file in setup.ts using server_state or crawl?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'm doing that in this file: src/api/routes/v1/get-network.ts. Please refer to line 254. Here is a snippet:

    // check if node public key is already recorded
    const { public_key, network_id } = crawl.this_node
    const publicKeyNetwork = await getNetworkFromPublicKey(public_key)
    if (publicKeyNetwork != null) {
      return res.status(200).send({
        result: 'success',
        network: publicKeyNetwork,
      })
    }
    // add node to networks list
    const newNetwork = await addNode(entryUrl, node_unl, port, network_id)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On a more general note, I have updated the crawler to include the network_id information into the crawls table. Please refer to this commit: 699c0a6

The reference file is src/crawler/network.ts.

Copy link
Collaborator

@pdp2121 pdp2121 Dec 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The get_network call only adds new networks when user type a valid url here (which would trigger api calls to vhs):
https://custom.xrpl.org/
If no one does that for an existing network in the table, then network_id entry for that row will be empty

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am updating the network_id values during the crawl process. Here is a snippet of the relevant update:
image

On a more general note, I have updated the crawler to include the network_id information into the crawls table. Please refer to this commit: 699c0a6

^^ This commit offers more context about this change.

This change ensures that the crawls will always be up-to-date.

With regards to the networks table:
This is the only location in the codebase where we insert into the networks table (all the other insertions are accompanied with the network_id field). As highlighted in this picture, we are immediately invoking the crawl which will populate the network_id value from the server_state response.

This ensures that all entries in the networks table will have an associated network_id value.

image

Copy link
Collaborator

@pdp2121 pdp2121 Dec 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the crawlNode would save to the crawl table, not the networks table. Also as I mentioned, existing row in networks table outside of what we manually added in networks.ts file will not be updated. (addNode only happens on /network/get_network/:entryUrl for non-existing network on the networks table

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are only two places in the codebase where we insert into the networks table. One of them is the addNode method and the other one is inside the setup.ts file. To maintain parity with the existing code base, I have ensured that in both these places, network_id field is set correctly.
image

You are right, even if I encounter a new network, that is not being recorded inside the networks table. However, this is maintaining parity with the current production VHS code. As part of this PR, I would not recommend adding new instances of inserting into the networks table.

…cle. Although the scores are stored in the hourly_agreement table, they are not refreshed into the daily_agreement table. This commit fixes this issue and provides a unit test that demonstrates the problematic case
…e mapping between network-names and network_id

- remove the test that validates total count of ledgers over 24 hours. This logic is already covered by the tests test/chains/compute-total-ledgers-per-hour.test.ts and test/chains/turn-of-day-agreement.test.ts. Further, this test pollutes the test logs with several hundreds of logs, which decrease readability.
@ckeshava
Copy link
Contributor Author

ckeshava commented Dec 16, 2025

Comparison between VHS and Clio: The below table compares the total number of closed ledgers as counted by the VHS and Clio respectively.

Date / Day Ledgers recorded in Clio (Ground Truth) Ledgers recorded in VHS (current PR) Missed Ledgers in current PR Ledgers recorded by prod-VHS Missed Ledgers in prod-VHS
Sunday, December 14, 2025 22126 21931 195 21206 920
Saturday, December 13, 2025 22098 21976 122 21175 923
Friday, December 12, 2025 22080 21962 118 19673 2407
Thursday, December 11, 2025 22198 22076 122 19320 2878

We do not know the cause for the difference of ~100 ledgers every day. Here are two possible reasons:

  1. WebSockets are refreshed only once every hour, with an exponential backoff strategy. If the connection were to drop in the middle, we will lose validation data from that node/validator. As Identify stale WebSockets #394 demonstrates, there are known issues in the lifecycle management of web-sockets in VHS.
  2. The missing ledgers have drastically increased over the last week. As shown in the above table, Nov 15, 2025 - Dec 7, 2025 had <20 missed ledgers. This could be due to fluctuations in the XRPL Mainnet network infrastructure (or) a hitherto unknown cause.

However, we need to compare the benefits offered by this PR with the current VHS-production environment. This PR offers ~ 10x improvement over the prod-VHS because prod-VHS misses nearly 1000 ledgers per day.

The details of this issue can be found in this solution commit: 636a7bb. Please refer to the last column of the above table to understand the benefits of this PR.

cc @kuan121 @pdp2121

@pdp2121
Copy link
Collaborator

pdp2121 commented Dec 17, 2025

Comparison between VHS and Clio: The below table compares the total number of closed ledgers as counted by the VHS and Clio respectively.

Date / Day Ledgers recorded in Clio (Ground Truth) Ledgers recorded in VHS (current PR) Missed Ledgers in current PR Ledgers recorded by prod-VHS Missed Ledgers in prod-VHS
Sunday, December 14, 2025 22126 21931 195 21206 920
Saturday, December 13, 2025 22098 21976 122 21175 923
Friday, December 12, 2025 22080 21962 118 19673 2407
Thursday, December 11, 2025 22198 22076 122 19320 2878
We do not know the cause for the difference of ~100 ledgers every day. Here are two possible reasons:

  1. WebSockets are refreshed only once every hour, with an exponential backoff strategy. If the connection were to drop in the middle, we will lose validation data from that node/validator. As Identify stale WebSockets #394 demonstrates, there are known issues in the lifecycle management of web-sockets in VHS.
  2. The missing ledgers have drastically increased over the last week. As shown in the above table, Nov 15, 2025 - Dec 7, 2025 had <20 missed ledgers. This could be due to fluctuations in the XRPL Mainnet network infrastructure (or) a hitherto unknown cause.

However, we need to compare the benefits offered by this PR with the current VHS-production environment. This PR offers ~ 10x improvement over the prod-VHS because prod-VHS misses nearly 1000 ledgers per day.

The details of this issue can be found in this solution commit: 636a7bb. Please refer to the last column of the above table to understand the benefits of this PR.

cc @kuan121 @pdp2121

Did we record the starting and ending ledgers for each day for each service/version? I think that would give us a better view of why there's some discrepancies between Clio and ours

@pdp2121
Copy link
Collaborator

pdp2121 commented Dec 17, 2025

Also I don't think this pull request is meant to improve the accuracy since we just changed the way we identify network, which cleans up the code and is also useful for later improvement (like the queue). I think the discrepancies between what's in staging and production might be due to something else (e.g. less number of restarts between staging and prod, etc.)

start: yesterdayStart,
end: yesterdayEnd,
})
// Note: saveDailyAgreement method is idempotent. There is no harm in invoking it for today and previous day.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any specific reasons for doing it for yesterday?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used yesterday for the variable naming purposes only, it does not refer to any fixed-point in time. I am accounting for the ledgers obtained during 2300 - 2359 hours of every day.

To ensure that these ledgers are not missed, we need to run this computation once again for the previous day's date.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Until this day, the VHS has not been counting the ledgers processed from 2300-2359 hours. I have also demonstrated this bug via the unit-test: https://github.com/ripple/validator-history-service/pull/396/changes#diff-88d9c84c5ab68911e6db737a6734d26cdea22cb5f203b4b3d3bb38007e53815c

This test fails with the current production VHS code.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, great fix!

@ckeshava
Copy link
Contributor Author

Did we record the starting and ending ledgers for each day for each service/version? I think that would give us a better view of why there's some discrepancies between Clio and ours

I have attached the relevant log files which indicate all the validated ledgers processed every hour in the said time-period. The per-hour logs are very useful to correlate potential issues in the VHS. However, doing do would vastly expand the scope of this PR. I agree, it needs to be done in future work.

Here is an example log file:
{"name":"chains","hostname":"vhs-connections-5d94d985f6-mmpv2","pid":1,"level":30,"msg":"Over the previous hour, VHS processed ledgers between indices: (100866284 to 100867205 (inclusive)] on XRPL Mainnet.","time":"2025-12-14T09:04:11.131Z","v":0}

Also I don't think this pull request is meant to improve the accuracy since we just changed the way we identify network, which cleans up the code and is also useful for later improvement (like the queue).

I identified a critical computation in the logic of the VHS. This commit has more details: 636a7bb.

I think the discrepancies between what's in staging and production might be due to something else (e.g. less number of restarts between staging and prod, etc.)

As shown in the below image, there have been zero pod restarts on Dec 14, 2025 and Dec 15, 2025 in the VHS-prod environment. However, the ledger counts fall short of the actual number due to the above bug.

image

Dec-11-2025-Logs.txt
Dec-12-2025-Logs.txt
Dec-13-2025-Logs.txt
Dec-14-2025-Logs.txt

@ckeshava ckeshava merged commit 3c4c556 into ripple:main Dec 22, 2025
3 checks passed
@ckeshava
Copy link
Contributor Author

Thank you for the reviews. @kuan121 @pdp2121

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants