-
Notifications
You must be signed in to change notification settings - Fork 31
RoutingTable node no longer removed in replaceNode when replacement cache empty #791
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
…nt cache is empty.
kdeme
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is longstanding issue related to this: #262
That issue also refers to some testing done on early testnets: #261 (comment)
Basically the gist of it is/was that the re-pinging of stale nodes was causing less quick enabling of new (unseen / unverified) nodes.
This for sure had to some part to due with the state of those early networks.
It is possibly the case now that the quick removal of nodes on failure might work morecounter productive due to having to re-add / ping eventually the same or other nodes. This does depend somewhat on how well filled the replacement caches are typically.
Now with this PR, stale nodes that are in close to the local node buckets, are unlikely to be ever removed (yet will be re-pinged). This is going to be a very small subset of nodes of the routing table however, so it probably does not outweigh the benefit? Difficult to really know though without some metrics.
| # replacements. However, that would require a bit more complexity in the | ||
| # revalidation as you don't want to try pinging that node all the time. | ||
|
|
||
| var replaced = false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I find it slightly cleaner to just return the resulting bool directly in each if/else clause
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, will update.
tests/p2p/test_routing_table.nim
Outdated
| # This node should still be removed | ||
| check (table.getNode(bucketNodes[bucketNodes.high].id)).isNone() | ||
| # This node should not be removed | ||
| check (table.getNode(bucketNodes[bucketNodes.high].id)).isSome() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could add a check on the n.seen = false here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, will do.
tests/p2p/test_routing_table.nim
Outdated
| res.isSome() | ||
| res.get() == doubleNode | ||
| table.len == 1 | ||
| table.len == 16 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a part of this test that checks if order of revalidations is preserved by checking at the end if the latest added node (doubleNode) is still there (the one not revalidated and thus replaced).
As now all nodes remain, this getNode will for sure always pass.
I think same test can be achieved by checking the seen value for each node.
tests/p2p/test_routing_table.nim
Outdated
| for n in bucketNodes: | ||
| table.replaceNode(table.nodeToRevalidate()) | ||
| check (table.getNode(n.id)).isNone() | ||
| check (table.getNode(n.id)).isSome() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
similar item as above, the getNode is no longer a good test. I think we need to check seen value also.
|
@kdeme This is the test that is failing which I'm not sure how is best to fix it: See failure in CI here: https://github.com/status-im/nim-eth/pull/791/checks#step:11:555 |
| if b.replacementCache.len == 0: | ||
| let idx = b.nodes.find(n) | ||
| if idx >= 0 and n.seen: | ||
| b.nodes[idx].seen = false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kdeme When marking the node as not seen do you think we should also move the node to the end of the bucket (the least recently seen position)?
Thanks. Good to have the history linked here.
I would think at the Discv5 level the network is very large so the replacement caches would likely be very full so perhaps this change wouldn't have much impact on Discv5 network. In portal the replacement caches are likely smaller.
Well its a small number of nodes and the ping process picks a bucket at random so its likely a small overhead. I would say in the short term this is probably acceptable overhead but in the longer term we should use peer scoring in combination with node banning to remove misbehaving nodes from the routing table which would mitigate this issue you describe. |
I have now fixed this test in my last commit. |
When running Fluffy using a local test network and sending high load I found that the routing tables of the nodes in my local testnet would break and not be able to recover because the nodes were being removed from the routing table even when the replacement cache was empty. The call to replaceNode in this scenario is triggured due to timeouts because of the high load locally but even so our routing table should be more resilient to failures like these in my opinion.
In order to fix the issue in this PR, we now no longer remove the node from the routing table if the replacement cache is empty and we simply set the node as not seen. I retested in my local environment and confirmed that this fixes the issue and my test networks no longer break under high load.
In the banNode call we still remove the node even if the replacement cache is empty because this is a special case where removing the node is actually preferred.