-
Notifications
You must be signed in to change notification settings - Fork 14
feat: Use network_id in the Validation subscripton stream
#396
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…e, update the chains.ts module to make use of network_id value, instead of custom chain.id values
…ule has >91% stmt test coverage
|
Just a few initial things:
|
… ledger instead of duplicate ledgers
Correctness Report for this PRThe new changes have been deployed on the VHS staging area since Nov 15, 2025. This post records the number of missed ledgers since the VHS logic has been updated to make use of The missed validators are recorded by the error log:
Note: The data for 19 Nov 2025 is measured until 10.20 AM PST. Possible reasons for the loss of ledgers in VHS
I will continue to monitor and investigate the precise reasons contributing to the loss of validations and ledgers. Merits of this PRHere is a comparison between the performance of the staging environment (with the current PR changes) and the production VHS (without any of these PR changes): Total number of ledgers processed in VHS prod: Total number of ledgers processed in VHS staging: Note: I have noticed two days where the prod-VHS has more ledgers than the stg-VHS by ~40 ledgers. This can be attributed to the reason #1 explained above. However, I'd be remiss not to mention the advantages of this PR. The row of Nov 22, 2025 shows that this PR captured ~3000 ledgers more than the ones recorded by the VHS-prod environment. The same can be observed on Nov 20, 2025 as well (an improvement of ~500 ledgers). AppendixThe below files contain the relevant logs for each of the listed days: 2025-11-26-Missing-Ledgers.txt |
I have added the network_id field into both networks and crawls table. 699c0a6
^^ Which map are you referring to in this comment? I'm aware of
Unit tests have been added to validate the new changes to the code base. |
|
|
||
| setInterval(async () => { | ||
| log.info(`${await getTotalConnectedNodes()} connections established`) | ||
| log.warn(`${await getTotalConnectedNodes()} connections established`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a warning log?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reverted this change in c2e7c26
src/connection-manager/chains.ts
Outdated
| this.updateChains(ledger) | ||
| } | ||
|
|
||
| // Note: The below loop is purely used to debug the XRPL Mainnet. There are no functional side-effects from this loop. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you plan to have these lines of code in production forever? Or they are just for debugging purpose and will be removed eventually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to keep them in production for ease-of-debugging. They output log-lines into Grafana/k8s-pod/kubectl and they have very little impact on the latency of responsibilities of the VHS.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
latency of responsibilities of the VHS
Latency isn’t the issue. My concerns are:
- We’re introducing special-case logic for specific networks and validators.
- The method calculateChainsFromLedgers is intended to perform calculations, but it’s now filled with debugging-related logging code. Have we considered better encapsulation—for example, moving this logic into a separate function, optionally in its own file, and calling it from here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We’re introducing special-case logic for specific networks and validators.
Yes, we are introducing special logic for the XRPL Mainnet, owing to its importance to the ecosystem. If logs are emitted for all the other XRPL networks, it is tedious to monitor + debug them.
There is no special handling of specific validators in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The method calculateChainsFromLedgers is intended to perform calculations, but it’s now filled with debugging-related logging code. Have we considered better encapsulation—for example, moving this logic into a separate function, optionally in its own file, and calling it from here?
Ok, I have incorporated this suggestion here: ac7b815
| if ( | ||
| validation.network_id === 0 && | ||
| Number(validation.ledger_index) < 100000000 | ||
| ) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will never happen since we have already passed 100 million ledgers, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had observed this situation on a development branch of the VHS in my local computer. However, I'm not able to show you any examples on the stg environment yet.
Unfortunately, I closed the terminal which contained this output on my local computer.
Some validator was replaying ledgers from 2013 from the XRPL Mainnet. Since it contained the correct network_id: 0 and correct cryptographic signatures, that was polluting the ledger-space of all the Mainnet validators.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for providing the context.
Some validator was replaying ledgers from 2013 from the XRPL Mainnet.
Using Number(validation.ledger_index) < 100000000 can't solve this replay issue since 100,000,000 will soon be considered as old.
that was polluting the ledger-space of all the Mainnet validators.
I’m not fully clear on this part. Do you mind elaborating it with an example?
If this kind of replay has always been possible, what changed that makes it a concern now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using Number(validation.ledger_index) < 100000000 can't solve this replay issue since 100,000,000 will soon be considered as old.
You are right, I'm trying to think of a more elegant solution to this problem. The chains.current data member makes a good candidate to be used here. However, that would need a significant refactor of the code.
I’m not fully clear on this part. Do you mind elaborating it with an example?
Suppose you pick up the XRPL Mainnet ledger whose index is 80_000. You already know the contents of the ledger -- the sequence number, the order and content of transactions, etc. You now create a new validator (public master_key and private-key pair) and sign this ledger. You now broadcast this signed+validated-ledger through the XRPL overlay network.
Note: This does not violate any rules of the XRPL consensus protocol. However, for the purposes of the VHS, it includes a really old ledger (whose index is 80_000) in the current set of XRPL Mainnet ledgers. Hence, the VHS must consciously discard such old ledgers.
If this kind of replay has always been possible, what changed that makes it a concern now?
This is a bug in VHS that has existed since its inception. I discovered it a few days ago while generating the correctness report for this PR. I believe such old ledgers have always contaminated the agreement score calculation for the VHS.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems a bit adhoc to me. Maybe we can consider remove ledgers that's fall outside of a certain range of last processed range? I'm just trying to think of a more elegant way here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have come up with a more general solution here: 0625e3a
There are no hard-coded numbers anymore.
|
There are about 30% of validators not on v2.6.0+, how will this affect our calculation? |
The changes in this PR discard validations if they do not contain the However, this does not necessarily mean that 30% of all-unique-validations are discarded by the VHS, this is explained in the correctness-report above. Even if duplicate validations are discarded, that does not affect the consensus process. However, unique validations (unique based on their signing keys) could be discard, due to the lack of |
…s; No changes to the PrimaryKey structure of the tables
|
There's also validators network categorization in these parts: |
Yes, I considered this. I felt it could wait until a later date, when users express a need for a change in this table. As far as my knowledge goes, this value is not consumed within the VHS or other well-known downstream services. It would be an addition of the Hence, I feel we can wait for a few more months to ensure that the agreement scores and ledgers are computed correctly. |
…sed on historical records of the validators' public key
src/connection-manager/chains.ts
Outdated
| // eslint-disable-next-line @typescript-eslint/no-unnecessary-condition -- older rippled binaries do not return a network_id field | ||
| if (validation.network_id === undefined) { | ||
| const networkNameToChainID = new Map<string, number>() | ||
| networkNameToChainID.set('mainnet', 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use the network table for this instead of hardcoding. It will be cleaner and also ensuring that we have all the networks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok. I have updated it in this commit: d4f8cea
| entry: string | ||
| unls: string[] | ||
| // Context about the network_id field in rippled: https://xrpl.org/docs/references/protocol/transactions/common-fields#networkid-field | ||
| network_id: number |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should try to get the network_id for existing networks (in networks table) not belong to this file in setup.ts using server_state or crawl?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I'm doing that in this file: src/api/routes/v1/get-network.ts. Please refer to line 254. Here is a snippet:
// check if node public key is already recorded
const { public_key, network_id } = crawl.this_node
const publicKeyNetwork = await getNetworkFromPublicKey(public_key)
if (publicKeyNetwork != null) {
return res.status(200).send({
result: 'success',
network: publicKeyNetwork,
})
}
// add node to networks list
const newNetwork = await addNode(entryUrl, node_unl, port, network_id)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On a more general note, I have updated the crawler to include the network_id information into the crawls table. Please refer to this commit: 699c0a6
The reference file is src/crawler/network.ts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The get_network call only adds new networks when user type a valid url here (which would trigger api calls to vhs):
https://custom.xrpl.org/
If no one does that for an existing network in the table, then network_id entry for that row will be empty
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am updating the network_id values during the crawl process. Here is a snippet of the relevant update:

On a more general note, I have updated the crawler to include the network_id information into the crawls table. Please refer to this commit: 699c0a6
^^ This commit offers more context about this change.
This change ensures that the crawls will always be up-to-date.
With regards to the networks table:
This is the only location in the codebase where we insert into the networks table (all the other insertions are accompanied with the network_id field). As highlighted in this picture, we are immediately invoking the crawl which will populate the network_id value from the server_state response.
This ensures that all entries in the networks table will have an associated network_id value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the crawlNode would save to the crawl table, not the networks table. Also as I mentioned, existing row in networks table outside of what we manually added in networks.ts file will not be updated. (addNode only happens on /network/get_network/:entryUrl for non-existing network on the networks table
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are only two places in the codebase where we insert into the networks table. One of them is the addNode method and the other one is inside the setup.ts file. To maintain parity with the existing code base, I have ensured that in both these places, network_id field is set correctly.

You are right, even if I encounter a new network, that is not being recorded inside the networks table. However, this is maintaining parity with the current production VHS code. As part of this PR, I would not recommend adding new instances of inserting into the networks table.
…cle. Although the scores are stored in the hourly_agreement table, they are not refreshed into the daily_agreement table. This commit fixes this issue and provides a unit test that demonstrates the problematic case
…e mapping between network-names and network_id - remove the test that validates total count of ledgers over 24 hours. This logic is already covered by the tests test/chains/compute-total-ledgers-per-hour.test.ts and test/chains/turn-of-day-agreement.test.ts. Further, this test pollutes the test logs with several hundreds of logs, which decrease readability.
… the hardcoded value of 100 million
|
Comparison between VHS and Clio: The below table compares the total number of closed ledgers as counted by the VHS and Clio respectively.
We do not know the cause for the difference of ~100 ledgers every day. Here are two possible reasons:
However, we need to compare the benefits offered by this PR with the current VHS-production environment. This PR offers ~ 10x improvement over the prod-VHS because prod-VHS misses nearly 1000 ledgers per day. The details of this issue can be found in this solution commit: 636a7bb. Please refer to the last column of the above table to understand the benefits of this PR. |
Did we record the starting and ending ledgers for each day for each service/version? I think that would give us a better view of why there's some discrepancies between Clio and ours |
|
Also I don't think this pull request is meant to improve the accuracy since we just changed the way we identify network, which cleans up the code and is also useful for later improvement (like the queue). I think the discrepancies between what's in staging and production might be due to something else (e.g. less number of restarts between staging and prod, etc.) |
| start: yesterdayStart, | ||
| end: yesterdayEnd, | ||
| }) | ||
| // Note: saveDailyAgreement method is idempotent. There is no harm in invoking it for today and previous day. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any specific reasons for doing it for yesterday?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used yesterday for the variable naming purposes only, it does not refer to any fixed-point in time. I am accounting for the ledgers obtained during 2300 - 2359 hours of every day.
To ensure that these ledgers are not missed, we need to run this computation once again for the previous day's date.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Until this day, the VHS has not been counting the ledgers processed from 2300-2359 hours. I have also demonstrated this bug via the unit-test: https://github.com/ripple/validator-history-service/pull/396/changes#diff-88d9c84c5ab68911e6db737a6734d26cdea22cb5f203b4b3d3bb38007e53815c
This test fails with the current production VHS code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, great fix!
I have attached the relevant log files which indicate all the validated ledgers processed every hour in the said time-period. The per-hour logs are very useful to correlate potential issues in the VHS. However, doing do would vastly expand the scope of this PR. I agree, it needs to be done in future work. Here is an example log file:
I identified a critical computation in the logic of the VHS. This commit has more details: 636a7bb.
As shown in the below image, there have been zero pod restarts on Dec 14, 2025 and Dec 15, 2025 in the VHS-prod environment. However, the ledger counts fall short of the actual number due to the above bug.
Dec-11-2025-Logs.txt |



High Level Overview of Change
Use
network_idvalue from the Validation subscription stream. This PR obviates the need for complexchainsassignment logic in thechains.updateChainsmethod.Context of Change
Before / After
There are no changes to the externally visible APIs.
Test Plan
Existing unit tests pass with this refactor. Additional unit tests have also been added to test the logic of chains.