-
Notifications
You must be signed in to change notification settings - Fork 887
Downscore and retry custody failures #7510
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: peerdas-devnet-7
Are you sure you want to change the base?
Conversation
@@ -545,6 +545,9 @@ impl<T: BeaconChainTypes> SyncManager<T> { | |||
for (id, result) in self.network.continue_custody_by_root_requests() { | |||
self.on_custody_by_root_result(id, result); | |||
} | |||
for (id, result) in self.network.continue_custody_by_range_requests() { | |||
self.on_custody_by_range_result(id, result); | |||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Every interval (15 sec), we call continue_custody_by_range / by_root requests, which will cause the request to error if it has been alive for too long. This allows the requests to not error immediately if they do not have enough custody peers.
@@ -442,6 +439,9 @@ impl<T: BeaconChainTypes> SyncManager<T> { | |||
for (id, result) in self.network.continue_custody_by_root_requests() { | |||
self.on_custody_by_root_result(id, result); | |||
} | |||
for (id, result) in self.network.continue_custody_by_range_requests() { | |||
self.on_custody_by_range_result(id, result); | |||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Every time a peer joins, attempt to progress custody_by_root and custody_by_range requests
@@ -3,7 +3,6 @@ | |||
//! Stores the various syncing methods for the beacon chain. | |||
mod backfill_sync; | |||
mod block_lookups; | |||
mod block_sidecar_coupling; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Logic moved to beacon_node/network/src/sync/network_context/block_components_by_range.rs
}; | ||
|
||
pub mod custody; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed existing custody
module to custody_by_root
and added a new one custody_by_range
} | ||
|
||
/// Returns the ids of all active requests | ||
pub fn active_requests(&mut self) -> impl Iterator<Item = (SyncRequestId, &PeerId)> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed this signature for tests, to have access to all active RPC requests
|
||
Ok((requests, column_to_peer_map)) | ||
}) | ||
.transpose()?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
network_context no longer spawns _by_range requests, this logic is now inside BlockComponentsByRangeRequest
blocks_by_range_request: | ||
ByRangeRequest<BlocksByRangeRequestId, Vec<Arc<SignedBeaconBlock<E>>>>, | ||
blobs_by_range_request: ByRangeRequest<BlobsByRangeRequestId, Vec<Arc<BlobSidecar<E>>>>, | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maintains the same behaviour for mainnet:
- deneb: issue blocks + blobs requests at the same time
- fulu: issue blocks request first, then columns
} | ||
|
||
#[derive(Debug, PartialEq, Eq)] | ||
pub enum RpcRequestSendError { | ||
/// No peer available matching the required criteria | ||
NoPeer(NoPeerError), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Requests do not error now on send if they don't have peers. Instead, custody_by_root and custody_by_range requests are left idle for some time, expecting peers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dapplion I've started reviewing this but haven't got to the meat of the changes - will continue tomorrow but I've submitted my comments so far.
@@ -221,6 +256,12 @@ impl<T: BeaconChainTypes> SyncingChain<T> { | |||
request_id: Id, | |||
blocks: Vec<RpcBlock<T::EthSpec>>, | |||
) -> ProcessingResult { | |||
// Account for one more requests to this peer | |||
// TODO(das): this code assumes that we do a single request per peer per RpcBlock |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would be true in the below case:
peerA: block
peerB: col_1, col_2
peerC: col_2, col_3,
we get 1 req for each peer.
BUT for this below scenario:
peerA: block, col_5
peerB: col_1, col_2
peerC: col_2, col_3,
we still get 1 for each peer, which isn't fully correct.
Or we could just use BatchPeers
and count block peer separately and correctly.
Either way I don't think it makes a big difference - I think we can get rid of this TODO
and document the assumptions.
batch.download_failed(Some(*peer_id))? | ||
// TODO(das): Is it necessary for the batch to track failed peers? Can we make this | ||
// mechanism compatible with PeerDAS and before PeerDAS? | ||
batch.download_failed(None)? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We still the peer to track failed block peers, and this will break the current peer prioritisation for block and blobs. I think this needs a bit more thought.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After discussing with @pawanjay176 we decided to not track download failures. We rely on randomness to select a different peer as we expect the syncing chains to have a decent number of peers. For the custody by range requests there's an internal failed_peers set that will de-prioritize peers with network errors. The failed_peers at the batch level is still used to track peers that sent blocks or columns that failed processing and ensure that we retry from a different peer
I've added a few more comments. I've spent quite a bit of time reading but I'm really struggling with reviewing this in the current state, with potential missing pieces and a bunch of outstanding I think it might be useful to go through the plan and code together with @pawanjay176, or re-review this once this is complete. What do you think? |
692ad81
to
42ef88b
Compare
af3d678
to
c6b39e9
Compare
I've just merged latest I've triggerd a CI workflow to run the latest checkpoint sync test on this PR: https://github.com/jimmygchen/lighthouse/actions/runs/15412363885 |
Looks like it's struggling with backfill - below link is a test run on sepolia - both nodes transitioned to On PeerDAS, supernode struggles to sync to head, but fullnode did complete backfilling 1000 slots in 506 seconds: Logs are available on the workflow summary page: https://github.com/jimmygchen/lighthouse/actions/runs/15412363885 |
@dapplion Would you mind taking a look at the failing CI tests in the PR? |
…p potentially faulty peers
Not sure if this is ready for review? Its crashing on mainnet with
|
Issue Addressed
Partially addresses
Proposed Changes
TBD
1000 lines are for new unit tests :)
Questions / TODO
Tests to add
Future TODOs
subscribe_all_data_column_subnets
to determineis_supernode
on_xxx_response
, suggestion -> Downscore and retry custody failures #7510 (comment)add_peer_*
functions in sync tests. Best to do in another PR as it touches a lot of lookup sync tests (unrelated)requests_per_peer
to ensure that eventually we fetch data from all peers in the chain