-
Notifications
You must be signed in to change notification settings - Fork 1
Description
Problem
sync_height is updated optimistically. This means that it is updated as soon as some peer claims to be at a higher height (higher than our tip_height) and we issue a ValueRequest for a range of heights from local sync_height to local sync_height+batch_size . The goal is to avoid having multiple in-flight requests for an overlapping range of heights. The implicit assumption is that the request will eventually complete successfully. Nevertheless, this cannot always be guaranteed and it may pose problems. For instance:
- Let
v_1be a honest validator andv_2a Byzantine one. v_1is behind: maybe it crashed and came back online after a while. Note that consensus may keep deciding blocks as long as at leastf+1correct are alive if Byzantine help.- That
v_1is behind implies that it may get stuck in consensus, i.e., it is not able to make progress via consensus because a majority of correct processes are not participating in those consensus instances anymore. Then,v_1can only catch up via ValueSync. - Assume that
v_1is at heighth-2and the rest are at heighth. - Just when
v_1gets back online, receives a status message fromv_2that claims to have decided values up toh+1. - If
v_1'sbatch_size=3,v_1will send a request with range[h+1, h+2]and updatesync_heighttoh+3 v_2never sends any values back and the request times out because no other peer is ath+2, e.g., if consensus has not been reached decision forh+2.
-v_1will never be able to sync: once it updates itssync_heighttoh+3, it can only issue requests for ranges>= h+3.
The scenario actually does not need v_2 to be Byzantine but it is more obvious this way.
Solution
We can decouple receiving responses from sending requests by introducing a bounded queue requests_to_sent in the sync actor that operates as follows. Each time we get an invalid response, timeout, etc. instead of directly re-requesting we store the request in requests_to_sent. Similarly, on_status simply saves the request in the same queue, etc. We then have a method periodical_send that is periodically called that pops requests from this requests_to_sent and sends them over the network.