Skip to content

fix: processes can get stuck during ValueSync #14

@angbrav

Description

@angbrav

Problem
sync_height is updated optimistically. This means that it is updated as soon as some peer claims to be at a higher height (higher than our tip_height) and we issue a ValueRequest for a range of heights from local sync_height to local sync_height+batch_size . The goal is to avoid having multiple in-flight requests for an overlapping range of heights. The implicit assumption is that the request will eventually complete successfully. Nevertheless, this cannot always be guaranteed and it may pose problems. For instance:

  • Let v_1 be a honest validator and v_2 a Byzantine one.
  • v_1 is behind: maybe it crashed and came back online after a while. Note that consensus may keep deciding blocks as long as at least f+1 correct are alive if Byzantine help.
  • That v_1 is behind implies that it may get stuck in consensus, i.e., it is not able to make progress via consensus because a majority of correct processes are not participating in those consensus instances anymore. Then, v_1 can only catch up via ValueSync.
  • Assume that v_1 is at height h-2 and the rest are at height h.
  • Just when v_1 gets back online, receives a status message from v_2 that claims to have decided values up to h+1.
  • If v_1's batch_size=3, v_1 will send a request with range [h+1, h+2] and update sync_height to h+3
  • v_2 never sends any values back and the request times out because no other peer is at h+2, e.g., if consensus has not been reached decision for h+2.
    -v_1 will never be able to sync: once it updates its sync_height to h+3, it can only issue requests for ranges >= h+3.

The scenario actually does not need v_2 to be Byzantine but it is more obvious this way.

Solution
We can decouple receiving responses from sending requests by introducing a bounded queue requests_to_sent in the sync actor that operates as follows. Each time we get an invalid response, timeout, etc. instead of directly re-requesting we store the request in requests_to_sent. Similarly, on_status simply saves the request in the same queue, etc. We then have a method periodical_send that is periodically called that pops requests from this requests_to_sent and sends them over the network.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions