RoutingTable node no longer removed in replaceNode when replacement cache empty #791

bhartnett · 2025-05-06T06:21:26Z

When running Fluffy using a local test network and sending high load I found that the routing tables of the nodes in my local testnet would break and not be able to recover because the nodes were being removed from the routing table even when the replacement cache was empty. The call to replaceNode in this scenario is triggured due to timeouts because of the high load locally but even so our routing table should be more resilient to failures like these in my opinion.

In order to fix the issue in this PR, we now no longer remove the node from the routing table if the replacement cache is empty and we simply set the node as not seen. I retested in my local environment and confirmed that this fixes the issue and my test networks no longer break under high load.

In the banNode call we still remove the node even if the replacement cache is empty because this is a special case where removing the node is actually preferred.

…nt cache is empty.

kdeme

There is longstanding issue related to this: #262

That issue also refers to some testing done on early testnets: #261 (comment)

Basically the gist of it is/was that the re-pinging of stale nodes was causing less quick enabling of new (unseen / unverified) nodes.
This for sure had to some part to due with the state of those early networks.

It is possibly the case now that the quick removal of nodes on failure might work morecounter productive due to having to re-add / ping eventually the same or other nodes. This does depend somewhat on how well filled the replacement caches are typically.

Now with this PR, stale nodes that are in close to the local node buckets, are unlikely to be ever removed (yet will be re-pinged). This is going to be a very small subset of nodes of the routing table however, so it probably does not outweigh the benefit? Difficult to really know though without some metrics.

kdeme · 2025-05-06T09:19:58Z

eth/p2p/discoveryv5/routing_table.nim

  # replacements. However, that would require a bit more complexity in the
  # revalidation as you don't want to try pinging that node all the time.
+
+  var replaced = false


nit: I find it slightly cleaner to just return the resulting bool directly in each if/else clause

Sure, will update.

kdeme · 2025-05-06T09:23:08Z

tests/p2p/test_routing_table.nim

-    # This node should still be removed
-    check (table.getNode(bucketNodes[bucketNodes.high].id)).isNone()
+    # This node should not be removed
+    check (table.getNode(bucketNodes[bucketNodes.high].id)).isSome()


could add a check on the n.seen = false here.

Good point, will do.

kdeme · 2025-05-06T09:41:11Z

tests/p2p/test_routing_table.nim

      res.isSome()
      res.get() == doubleNode
-      table.len == 1
+      table.len == 16


There was a part of this test that checks if order of revalidations is preserved by checking at the end if the latest added node (doubleNode) is still there (the one not revalidated and thus replaced).

As now all nodes remain, this getNode will for sure always pass.

I think same test can be achieved by checking the seen value for each node.

kdeme · 2025-05-06T09:42:40Z

tests/p2p/test_routing_table.nim

    for n in bucketNodes:
      table.replaceNode(table.nodeToRevalidate())
-      check (table.getNode(n.id)).isNone()
+      check (table.getNode(n.id)).isSome()


similar item as above, the getNode is no longer a good test. I think we need to check seen value also.

bhartnett · 2025-05-06T13:10:25Z

@kdeme This is the test that is failing which I'm not sure how is best to fix it:
/home/runner/work/nim-eth/nim-eth/tests/p2p/test_discoveryv5.nim
/home/runner/work/nim-eth/nim-eth/build/p2p/all_tests 'Discovery v5.1 Tests::Resolve target'

See failure in CI here: https://github.com/status-im/nim-eth/pull/791/checks#step:11:555

bhartnett · 2025-05-07T01:38:36Z

eth/p2p/discoveryv5/routing_table.nim

+  if b.replacementCache.len == 0:
+    let idx = b.nodes.find(n)
+    if idx >= 0 and n.seen:
+      b.nodes[idx].seen = false


@kdeme When marking the node as not seen do you think we should also move the node to the end of the bucket (the least recently seen position)?

bhartnett · 2025-05-07T03:25:20Z

There is longstanding issue related to this: #262

That issue also refers to some testing done on early testnets: #261 (comment)

Thanks. Good to have the history linked here.

Basically the gist of it is/was that the re-pinging of stale nodes was causing less quick enabling of new (unseen / unverified) nodes. This for sure had to some part to due with the state of those early networks.

It is possibly the case now that the quick removal of nodes on failure might work morecounter productive due to having to re-add / ping eventually the same or other nodes. This does depend somewhat on how well filled the replacement caches are typically.

I would think at the Discv5 level the network is very large so the replacement caches would likely be very full so perhaps this change wouldn't have much impact on Discv5 network. In portal the replacement caches are likely smaller.

Now with this PR, stale nodes that are in close to the local node buckets, are unlikely to be ever removed (yet will be re-pinged). This is going to be a very small subset of nodes of the routing table however, so it probably does not outweigh the benefit? Difficult to really know though without some metrics.

Well its a small number of nodes and the ping process picks a bucket at random so its likely a small overhead. I would say in the short term this is probably acceptable overhead but in the longer term we should use peer scoring in combination with node banning to remove misbehaving nodes from the routing table which would mitigate this issue you describe.

bhartnett · 2025-05-07T03:51:02Z

@kdeme This is the test that is failing which I'm not sure how is best to fix it: /home/runner/work/nim-eth/nim-eth/tests/p2p/test_discoveryv5.nim /home/runner/work/nim-eth/nim-eth/build/p2p/all_tests 'Discovery v5.1 Tests::Resolve target'

See failure in CI here: https://github.com/status-im/nim-eth/pull/791/checks#step:11:555

I have now fixed this test in my last commit.

Fix replaceNode so that the node will not be removed if the replaceme…

8303a54

…nt cache is empty.

bhartnett changed the title ~~p2p RoutingTable: Fix replaceNode so that the node will not be removed if the replacement cache is empty.~~ p2p RoutingTable: node no longer removed in replaceNode when replacement cache empty May 6, 2025

kdeme reviewed May 6, 2025

View reviewed changes

bhartnett commented May 7, 2025

View reviewed changes

Improve tests.

702c93b

Fix Discv5 test.

4733859

bhartnett requested a review from kdeme May 7, 2025 04:00

bhartnett changed the title ~~p2p RoutingTable: node no longer removed in replaceNode when replacement cache empty~~ RoutingTable node no longer removed in replaceNode when replacement cache empty May 11, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

RoutingTable node no longer removed in replaceNode when replacement cache empty #791

RoutingTable node no longer removed in replaceNode when replacement cache empty #791

Uh oh!

bhartnett commented May 6, 2025 •

edited

Loading

Uh oh!

kdeme left a comment

Uh oh!

kdeme May 6, 2025

Uh oh!

bhartnett May 6, 2025

Uh oh!

kdeme May 6, 2025

Uh oh!

bhartnett May 6, 2025

Uh oh!

kdeme May 6, 2025

Uh oh!

kdeme May 6, 2025

Uh oh!

bhartnett commented May 6, 2025

Uh oh!

bhartnett May 7, 2025

Uh oh!

bhartnett commented May 7, 2025

Uh oh!

bhartnett commented May 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

RoutingTable node no longer removed in replaceNode when replacement cache empty #791

Are you sure you want to change the base?

RoutingTable node no longer removed in replaceNode when replacement cache empty #791

Uh oh!

Conversation

bhartnett commented May 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kdeme left a comment

Choose a reason for hiding this comment

Uh oh!

kdeme May 6, 2025

Choose a reason for hiding this comment

Uh oh!

bhartnett May 6, 2025

Choose a reason for hiding this comment

Uh oh!

kdeme May 6, 2025

Choose a reason for hiding this comment

Uh oh!

bhartnett May 6, 2025

Choose a reason for hiding this comment

Uh oh!

kdeme May 6, 2025

Choose a reason for hiding this comment

Uh oh!

kdeme May 6, 2025

Choose a reason for hiding this comment

Uh oh!

bhartnett commented May 6, 2025

Uh oh!

bhartnett May 7, 2025

Choose a reason for hiding this comment

Uh oh!

bhartnett commented May 7, 2025

Uh oh!

bhartnett commented May 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

bhartnett commented May 6, 2025 •

edited

Loading