Skip to content

Conversation

@bhartnett
Copy link
Contributor

@bhartnett bhartnett commented May 6, 2025

When running Fluffy using a local test network and sending high load I found that the routing tables of the nodes in my local testnet would break and not be able to recover because the nodes were being removed from the routing table even when the replacement cache was empty. The call to replaceNode in this scenario is triggured due to timeouts because of the high load locally but even so our routing table should be more resilient to failures like these in my opinion.

In order to fix the issue in this PR, we now no longer remove the node from the routing table if the replacement cache is empty and we simply set the node as not seen. I retested in my local environment and confirmed that this fixes the issue and my test networks no longer break under high load.

In the banNode call we still remove the node even if the replacement cache is empty because this is a special case where removing the node is actually preferred.

@bhartnett bhartnett changed the title p2p RoutingTable: Fix replaceNode so that the node will not be removed if the replacement cache is empty. p2p RoutingTable: node no longer removed in replaceNode when replacement cache empty May 6, 2025
Copy link
Contributor

@kdeme kdeme left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is longstanding issue related to this: #262

That issue also refers to some testing done on early testnets: #261 (comment)

Basically the gist of it is/was that the re-pinging of stale nodes was causing less quick enabling of new (unseen / unverified) nodes.
This for sure had to some part to due with the state of those early networks.

It is possibly the case now that the quick removal of nodes on failure might work morecounter productive due to having to re-add / ping eventually the same or other nodes. This does depend somewhat on how well filled the replacement caches are typically.

Now with this PR, stale nodes that are in close to the local node buckets, are unlikely to be ever removed (yet will be re-pinged). This is going to be a very small subset of nodes of the routing table however, so it probably does not outweigh the benefit? Difficult to really know though without some metrics.

# replacements. However, that would require a bit more complexity in the
# revalidation as you don't want to try pinging that node all the time.

var replaced = false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I find it slightly cleaner to just return the resulting bool directly in each if/else clause

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, will update.

# This node should still be removed
check (table.getNode(bucketNodes[bucketNodes.high].id)).isNone()
# This node should not be removed
check (table.getNode(bucketNodes[bucketNodes.high].id)).isSome()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could add a check on the n.seen = false here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, will do.

res.isSome()
res.get() == doubleNode
table.len == 1
table.len == 16
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was a part of this test that checks if order of revalidations is preserved by checking at the end if the latest added node (doubleNode) is still there (the one not revalidated and thus replaced).

As now all nodes remain, this getNode will for sure always pass.

I think same test can be achieved by checking the seen value for each node.

for n in bucketNodes:
table.replaceNode(table.nodeToRevalidate())
check (table.getNode(n.id)).isNone()
check (table.getNode(n.id)).isSome()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar item as above, the getNode is no longer a good test. I think we need to check seen value also.

@bhartnett
Copy link
Contributor Author

@kdeme This is the test that is failing which I'm not sure how is best to fix it:
/home/runner/work/nim-eth/nim-eth/tests/p2p/test_discoveryv5.nim
/home/runner/work/nim-eth/nim-eth/build/p2p/all_tests 'Discovery v5.1 Tests::Resolve target'

See failure in CI here: https://github.com/status-im/nim-eth/pull/791/checks#step:11:555

if b.replacementCache.len == 0:
let idx = b.nodes.find(n)
if idx >= 0 and n.seen:
b.nodes[idx].seen = false
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kdeme When marking the node as not seen do you think we should also move the node to the end of the bucket (the least recently seen position)?

@bhartnett
Copy link
Contributor Author

There is longstanding issue related to this: #262

That issue also refers to some testing done on early testnets: #261 (comment)

Thanks. Good to have the history linked here.

Basically the gist of it is/was that the re-pinging of stale nodes was causing less quick enabling of new (unseen / unverified) nodes. This for sure had to some part to due with the state of those early networks.

It is possibly the case now that the quick removal of nodes on failure might work morecounter productive due to having to re-add / ping eventually the same or other nodes. This does depend somewhat on how well filled the replacement caches are typically.

I would think at the Discv5 level the network is very large so the replacement caches would likely be very full so perhaps this change wouldn't have much impact on Discv5 network. In portal the replacement caches are likely smaller.

Now with this PR, stale nodes that are in close to the local node buckets, are unlikely to be ever removed (yet will be re-pinged). This is going to be a very small subset of nodes of the routing table however, so it probably does not outweigh the benefit? Difficult to really know though without some metrics.

Well its a small number of nodes and the ping process picks a bucket at random so its likely a small overhead. I would say in the short term this is probably acceptable overhead but in the longer term we should use peer scoring in combination with node banning to remove misbehaving nodes from the routing table which would mitigate this issue you describe.

@bhartnett
Copy link
Contributor Author

@kdeme This is the test that is failing which I'm not sure how is best to fix it: /home/runner/work/nim-eth/nim-eth/tests/p2p/test_discoveryv5.nim /home/runner/work/nim-eth/nim-eth/build/p2p/all_tests 'Discovery v5.1 Tests::Resolve target'

See failure in CI here: https://github.com/status-im/nim-eth/pull/791/checks#step:11:555

I have now fixed this test in my last commit.

@bhartnett bhartnett requested a review from kdeme May 7, 2025 04:00
@bhartnett bhartnett changed the title p2p RoutingTable: node no longer removed in replaceNode when replacement cache empty RoutingTable node no longer removed in replaceNode when replacement cache empty May 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants