-
Notifications
You must be signed in to change notification settings - Fork 873
Downscore and retry custody failures #7510
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: peerdas-devnet-7
Are you sure you want to change the base?
Conversation
@@ -545,6 +545,9 @@ impl<T: BeaconChainTypes> SyncManager<T> { | |||
for (id, result) in self.network.continue_custody_by_root_requests() { | |||
self.on_custody_by_root_result(id, result); | |||
} | |||
for (id, result) in self.network.continue_custody_by_range_requests() { | |||
self.on_custody_by_range_result(id, result); | |||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Every interval (15 sec), we call continue_custody_by_range / by_root requests, which will cause the request to error if it has been alive for too long. This allows the requests to not error immediately if they do not have enough custody peers.
@@ -442,6 +439,9 @@ impl<T: BeaconChainTypes> SyncManager<T> { | |||
for (id, result) in self.network.continue_custody_by_root_requests() { | |||
self.on_custody_by_root_result(id, result); | |||
} | |||
for (id, result) in self.network.continue_custody_by_range_requests() { | |||
self.on_custody_by_range_result(id, result); | |||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Every time a peer joins, attempt to progress custody_by_root and custody_by_range requests
@@ -3,7 +3,6 @@ | |||
//! Stores the various syncing methods for the beacon chain. | |||
mod backfill_sync; | |||
mod block_lookups; | |||
mod block_sidecar_coupling; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Logic moved to beacon_node/network/src/sync/network_context/block_components_by_range.rs
}; | ||
|
||
pub mod custody; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed existing custody
module to custody_by_root
and added a new one custody_by_range
} | ||
|
||
/// Returns the ids of all active requests | ||
pub fn active_requests(&mut self) -> impl Iterator<Item = (SyncRequestId, &PeerId)> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed this signature for tests, to have access to all active RPC requests
|
||
Ok((requests, column_to_peer_map)) | ||
}) | ||
.transpose()?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
network_context no longer spawns _by_range requests, this logic is now inside BlockComponentsByRangeRequest
blocks_by_range_request: | ||
ByRangeRequest<BlocksByRangeRequestId, Vec<Arc<SignedBeaconBlock<E>>>>, | ||
blobs_by_range_request: ByRangeRequest<BlobsByRangeRequestId, Vec<Arc<BlobSidecar<E>>>>, | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maintains the same behaviour for mainnet:
- deneb: issue blocks + blobs requests at the same time
- fulu: issue blocks request first, then columns
} | ||
|
||
#[derive(Debug, PartialEq, Eq)] | ||
pub enum RpcRequestSendError { | ||
/// No peer available matching the required criteria | ||
NoPeer(NoPeerError), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Requests do not error now on send if they don't have peers. Instead, custody_by_root and custody_by_range requests are left idle for some time, expecting peers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dapplion I've started reviewing this but haven't got to the meat of the changes - will continue tomorrow but I've submitted my comments so far.
trusted_peers: Vec<PeerId>, | ||
config: Arc<NetworkConfig>, | ||
spec: Arc<ChainSpec>, | ||
is_supernode: bool, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it matters currently - but thought might be worth mentioning - we use config.subscribe_all_data_column_subnets
for determine the gossip topics, and it doesn't seem like we use it for determining is_supernode
elsewhere, but something to keep in mind - if we want consistency maybe we can derive is_supernode
from network config, but that may not be relevant from once we move to validator custody.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, we should revisit after impl validator custody. Since this function is for unit tests without gossip, it doesn't matter
@@ -1186,10 +1189,9 @@ impl<T: BeaconChainTypes> SyncManager<T> { | |||
block: RpcEvent<Arc<SignedBeaconBlock<T::EthSpec>>>, | |||
) { | |||
if let Some(resp) = self.network.on_blocks_by_range_response(id, peer_id, block) { | |||
self.on_range_components_response( | |||
self.on_block_components_by_range_response( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I often get confused here as we have layers of on_xxx_by_range_response
functions, what about renaming:
resp
toresult
self.on_block_components_by_range_response
toself.on_range_components_rpc_result
, and add a comment describing it:
self.on_block_components_by_range_response( | |
// Process it further if the response results in a completed request or a failure that needs to be handled. | |
self.on_range_components_rpc_result( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also might help to add some comment to self.network.on_blocks_by_range_response
. I wrote these notes while reviewing:
/// Processes the `BlocksByRange` response (`RpcEvent`), depending on the event, it may:
/// - update the active request state; or do nothing.
/// - complete the active request and return an `Ok` response containing a list of response objects.
/// - propagate the error if necessary.
pub(crate) fn on_blocks_by_range_response(
&mut self,
id: BlocksByRangeRequestId,
peer_id: PeerId,
rpc_event: RpcEvent<Arc<SignedBeaconBlock<T::EthSpec>>>,
) -> Option<RpcResponseResult<Vec<Arc<SignedBeaconBlock<T::EthSpec>>>>> {
self.add_connected_peer_testing_only(false) | ||
} | ||
|
||
// Don't make pub, use `add_connected_peer_testing_only` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the reason for this comment?
self.add_connected_peer_testing_only(true) | ||
} | ||
|
||
pub fn add_connected_peer_testing_only(&mut self, supernode: bool) -> PeerId { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
testing_only
seems redundant for a TestRig
function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree the add_peer functions are a bit of a mess, but changing them will trigger a big diff on the lookup tests, which use them extensively
.custody_subnets_iter() | ||
.map(|subnet| **subnet) | ||
.collect::<Vec<_>>(); | ||
peer_custody_subnets.sort_unstable(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
feel like sorting may be useful for debugging in prod? should we just sort them after computing here
lighthouse/beacon_node/lighthouse_network/src/peer_manager/mod.rs
Lines 746 to 762 in a497ec6
let custody_subnets = custody_groups | |
.into_iter() | |
.flat_map(|custody_index| { | |
self.subnets_by_custody_group | |
.get(&custody_index) | |
.cloned() | |
.unwrap_or_else(|| { | |
warn!( | |
%custody_index, | |
%peer_id, | |
"Custody group not found in subnet mapping" | |
); | |
vec![] | |
}) | |
}) | |
.collect(); | |
peer_info.set_custody_subnets(custody_subnets); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the code snipped you linked we collect as a HashSet for set_custody_subnets
.peers | ||
.write() | ||
.__add_connected_peer_testing_only(true, &self.harness.spec, key) | ||
pub fn add_sync_peer(&mut self, supernode: bool, remote_info: SyncInfo) -> PeerId { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little confused with all the add peer functions here, is it possible to consolidate some of them?
add_connected_peer
: adds a peer with connected status (but no status yet)add_connected_peer_with_status
: adds a connected peer and also send theSyncInfo
to sync
If not, I think some comments will help with readability here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added comments.
I agree the add_peer functions are a bit of a mess, but changing them will trigger a big diff on the lookup tests, which use them extensively
Self::PreDeneb => panic!("no requests PreDeneb"), | ||
Self::PrePeerDAS(_, peer, _) => *peer, | ||
Self::PostPeerDAS(reqs) => { | ||
if reqs.len() != 1 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why exactly 1 here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a comment to the function
/// If there's a single active request, returns its peer, else panics
fn peer(&self) -> PeerId {
Where this function is used, I expect a single request. If there are more and we return the first peer it feels a bit random an too loose
@@ -221,6 +256,12 @@ impl<T: BeaconChainTypes> SyncingChain<T> { | |||
request_id: Id, | |||
blocks: Vec<RpcBlock<T::EthSpec>>, | |||
) -> ProcessingResult { | |||
// Account for one more requests to this peer | |||
// TODO(das): this code assumes that we do a single request per peer per RpcBlock |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would be true in the below case:
peerA: block
peerB: col_1, col_2
peerC: col_2, col_3,
we get 1 req for each peer.
BUT for this below scenario:
peerA: block, col_5
peerB: col_1, col_2
peerC: col_2, col_3,
we still get 1 for each peer, which isn't fully correct.
Or we could just use BatchPeers
and count block peer separately and correctly.
Either way I don't think it makes a big difference - I think we can get rid of this TODO
and document the assumptions.
batch.download_failed(Some(*peer_id))? | ||
// TODO(das): Is it necessary for the batch to track failed peers? Can we make this | ||
// mechanism compatible with PeerDAS and before PeerDAS? | ||
batch.download_failed(None)? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We still the peer to track failed block peers, and this will break the current peer prioritisation for block and blobs. I think this needs a bit more thought.
I've added a few more comments. I've spent quite a bit of time reading but I'm really struggling with reviewing this in the current state, with potential missing pieces and a bunch of outstanding I think it might be useful to go through the plan and code together with @pawanjay176, or re-review this once this is complete. What do you think? |
Issue Addressed
Partially addresses
Proposed Changes
TBD
1000 lines are for new unit tests :)
Questions / TODO
Tests to add