fix: processes can get stuck during ValueSync

**Problem**
`sync_height` is updated optimistically. This means that it is updated as soon as some peer claims to be at a higher height (higher than our `tip_height`) and we issue a `ValueRequest` for a range of heights from local `sync_height` to local `sync_height+batch_size` . The goal is to avoid having multiple in-flight requests for an overlapping range of heights. The implicit assumption is that the request will eventually complete successfully. Nevertheless, this cannot always be guaranteed and it may pose problems. For instance:
- Let `v_1` be a honest validator and `v_2` a Byzantine one.
- `v_1` is behind: maybe it crashed and came back online after a while. Note that consensus may keep deciding blocks as long as at least `f+1` correct are alive if Byzantine help.
- That `v_1` is behind implies that it may get stuck in consensus, i.e., it is not able to make progress via consensus because a majority of correct processes are not participating in those consensus instances anymore. Then, `v_1` can only catch up via ValueSync.
- Assume that `v_1` is at height `h-2` and the rest are at height `h`.
- Just when `v_1` gets back online, receives a status message from `v_2` that claims to have decided values up to `h+1`.
- If `v_1`'s `batch_size=3`, `v_1` will send a request with range `[h+1, h+2]` and update `sync_height` to `h+3`
- `v_2` never sends any values back and the request times out because no other peer is at `h+2`, e.g., if consensus has not been reached decision for `h+2`.
-`v_1` will never be able to sync: once it updates its `sync_height` to `h+3`, it can only issue requests for ranges `>= h+3`.

The scenario actually does not need `v_2` to be Byzantine but it is more obvious this way. 

**Solution**
We can decouple receiving responses from sending requests by introducing a bounded queue `requests_to_sent` in the sync actor that operates as follows. Each time we get an invalid response, timeout, etc. instead of directly re-requesting we store the request in `requests_to_sent`. Similarly, `on_status` simply saves the request in the same queue, etc. We then have a method `periodical_send` that is periodically called that pops requests from this `requests_to_sent` and sends them over the network.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: processes can get stuck during ValueSync #14

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

fix: processes can get stuck during ValueSync #14

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions