Skip to content

Downscore and retry custody failures #7510

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 15 commits into
base: peerdas-devnet-7
Choose a base branch
from

Conversation

dapplion
Copy link
Collaborator

@dapplion dapplion commented May 22, 2025

Issue Addressed

Partially addresses

Proposed Changes

TBD

1000 lines are for new unit tests :)

Questions / TODO

  • Should custody_by_range requests try all possible peers before giving up? i.e. should they ignore the failures counter for custody failures? Add a test for this behaviour.

Tests to add

  • Peer does not send columns at all on specific index
  • We find no-one serving columns on a specific index, how to recover
  • Random tests where most peer are faulty and we have to keep cycling them. Use randomness and run the tests with different levels of fault % :=> build a simulator
  • Permanently fail a single column many times, but then resolve the rest such that we can reconstruct.
  • Send non-matching data columns
  • Test downscoring of custody failures
  • Test the fallback mechanism of peer selection
  • Test a syncing chain reaching 0 peers, both for forward sync and backwards sync
  • Test forwards sync on Fulu with blocks without data

@dapplion dapplion requested a review from jxs as a code owner May 22, 2025 05:17
@dapplion dapplion added work-in-progress PR is a work-in-progress das Data Availability Sampling labels May 22, 2025
@@ -545,6 +545,9 @@ impl<T: BeaconChainTypes> SyncManager<T> {
for (id, result) in self.network.continue_custody_by_root_requests() {
self.on_custody_by_root_result(id, result);
}
for (id, result) in self.network.continue_custody_by_range_requests() {
self.on_custody_by_range_result(id, result);
}
Copy link
Collaborator Author

@dapplion dapplion May 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Every interval (15 sec), we call continue_custody_by_range / by_root requests, which will cause the request to error if it has been alive for too long. This allows the requests to not error immediately if they do not have enough custody peers.

@@ -442,6 +439,9 @@ impl<T: BeaconChainTypes> SyncManager<T> {
for (id, result) in self.network.continue_custody_by_root_requests() {
self.on_custody_by_root_result(id, result);
}
for (id, result) in self.network.continue_custody_by_range_requests() {
self.on_custody_by_range_result(id, result);
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Every time a peer joins, attempt to progress custody_by_root and custody_by_range requests

@@ -3,7 +3,6 @@
//! Stores the various syncing methods for the beacon chain.
mod backfill_sync;
mod block_lookups;
mod block_sidecar_coupling;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logic moved to beacon_node/network/src/sync/network_context/block_components_by_range.rs

};

pub mod custody;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed existing custody module to custody_by_root and added a new one custody_by_range

}

/// Returns the ids of all active requests
pub fn active_requests(&mut self) -> impl Iterator<Item = (SyncRequestId, &PeerId)> {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed this signature for tests, to have access to all active RPC requests


Ok((requests, column_to_peer_map))
})
.transpose()?;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

network_context no longer spawns _by_range requests, this logic is now inside BlockComponentsByRangeRequest

blocks_by_range_request:
ByRangeRequest<BlocksByRangeRequestId, Vec<Arc<SignedBeaconBlock<E>>>>,
blobs_by_range_request: ByRangeRequest<BlobsByRangeRequestId, Vec<Arc<BlobSidecar<E>>>>,
},
Copy link
Collaborator Author

@dapplion dapplion May 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maintains the same behaviour for mainnet:

  • deneb: issue blocks + blobs requests at the same time
  • fulu: issue blocks request first, then columns

}

#[derive(Debug, PartialEq, Eq)]
pub enum RpcRequestSendError {
/// No peer available matching the required criteria
NoPeer(NoPeerError),
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requests do not error now on send if they don't have peers. Instead, custody_by_root and custody_by_range requests are left idle for some time, expecting peers.

Copy link
Member

@jimmygchen jimmygchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dapplion I've started reviewing this but haven't got to the meat of the changes - will continue tomorrow but I've submitted my comments so far.

trusted_peers: Vec<PeerId>,
config: Arc<NetworkConfig>,
spec: Arc<ChainSpec>,
is_supernode: bool,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it matters currently - but thought might be worth mentioning - we use config.subscribe_all_data_column_subnets for determine the gossip topics, and it doesn't seem like we use it for determining is_supernode elsewhere, but something to keep in mind - if we want consistency maybe we can derive is_supernode from network config, but that may not be relevant from once we move to validator custody.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, we should revisit after impl validator custody. Since this function is for unit tests without gossip, it doesn't matter

@@ -1186,10 +1189,9 @@ impl<T: BeaconChainTypes> SyncManager<T> {
block: RpcEvent<Arc<SignedBeaconBlock<T::EthSpec>>>,
) {
if let Some(resp) = self.network.on_blocks_by_range_response(id, peer_id, block) {
self.on_range_components_response(
self.on_block_components_by_range_response(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I often get confused here as we have layers of on_xxx_by_range_response functions, what about renaming:

  • resp to result
  • self.on_block_components_by_range_response to self.on_range_components_rpc_result, and add a comment describing it:
Suggested change
self.on_block_components_by_range_response(
// Process it further if the response results in a completed request or a failure that needs to be handled.
self.on_range_components_rpc_result(

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also might help to add some comment to self.network.on_blocks_by_range_response. I wrote these notes while reviewing:

    /// Processes the `BlocksByRange` response (`RpcEvent`), depending on the event, it may:
    /// - update the active request state; or do nothing.
    /// - complete the active request and return an `Ok` response containing a list of response objects.
    /// - propagate the error if necessary.
    pub(crate) fn on_blocks_by_range_response(
        &mut self,
        id: BlocksByRangeRequestId,
        peer_id: PeerId,
        rpc_event: RpcEvent<Arc<SignedBeaconBlock<T::EthSpec>>>,
    ) -> Option<RpcResponseResult<Vec<Arc<SignedBeaconBlock<T::EthSpec>>>>> {

@jimmygchen jimmygchen added the under-review A reviewer has only partially completed a review. label May 26, 2025
self.add_connected_peer_testing_only(false)
}

// Don't make pub, use `add_connected_peer_testing_only`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reason for this comment?

self.add_connected_peer_testing_only(true)
}

pub fn add_connected_peer_testing_only(&mut self, supernode: bool) -> PeerId {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

testing_only seems redundant for a TestRig function

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree the add_peer functions are a bit of a mess, but changing them will trigger a big diff on the lookup tests, which use them extensively

.custody_subnets_iter()
.map(|subnet| **subnet)
.collect::<Vec<_>>();
peer_custody_subnets.sort_unstable();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

feel like sorting may be useful for debugging in prod? should we just sort them after computing here

let custody_subnets = custody_groups
.into_iter()
.flat_map(|custody_index| {
self.subnets_by_custody_group
.get(&custody_index)
.cloned()
.unwrap_or_else(|| {
warn!(
%custody_index,
%peer_id,
"Custody group not found in subnet mapping"
);
vec![]
})
})
.collect();
peer_info.set_custody_subnets(custody_subnets);

Copy link
Collaborator Author

@dapplion dapplion May 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the code snipped you linked we collect as a HashSet for set_custody_subnets

.peers
.write()
.__add_connected_peer_testing_only(true, &self.harness.spec, key)
pub fn add_sync_peer(&mut self, supernode: bool, remote_info: SyncInfo) -> PeerId {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little confused with all the add peer functions here, is it possible to consolidate some of them?

  • add_connected_peer: adds a peer with connected status (but no status yet)
  • add_connected_peer_with_status: adds a connected peer and also send the SyncInfo to sync

If not, I think some comments will help with readability here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added comments.

I agree the add_peer functions are a bit of a mess, but changing them will trigger a big diff on the lookup tests, which use them extensively

Self::PreDeneb => panic!("no requests PreDeneb"),
Self::PrePeerDAS(_, peer, _) => *peer,
Self::PostPeerDAS(reqs) => {
if reqs.len() != 1 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why exactly 1 here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment to the function

    /// If there's a single active request, returns its peer, else panics
    fn peer(&self) -> PeerId {

Where this function is used, I expect a single request. If there are more and we return the first peer it feels a bit random an too loose

@@ -221,6 +256,12 @@ impl<T: BeaconChainTypes> SyncingChain<T> {
request_id: Id,
blocks: Vec<RpcBlock<T::EthSpec>>,
) -> ProcessingResult {
// Account for one more requests to this peer
// TODO(das): this code assumes that we do a single request per peer per RpcBlock
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be true in the below case:

peerA: block
peerB: col_1, col_2
peerC: col_2, col_3,

we get 1 req for each peer.

BUT for this below scenario:

peerA: block, col_5
peerB: col_1, col_2
peerC: col_2, col_3,

we still get 1 for each peer, which isn't fully correct.

Or we could just use BatchPeers and count block peer separately and correctly.

Either way I don't think it makes a big difference - I think we can get rid of this TODO and document the assumptions.

batch.download_failed(Some(*peer_id))?
// TODO(das): Is it necessary for the batch to track failed peers? Can we make this
// mechanism compatible with PeerDAS and before PeerDAS?
batch.download_failed(None)?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We still the peer to track failed block peers, and this will break the current peer prioritisation for block and blobs. I think this needs a bit more thought.

@jimmygchen
Copy link
Member

jimmygchen commented May 27, 2025

I've added a few more comments. I've spent quite a bit of time reading but I'm really struggling with reviewing this in the current state, with potential missing pieces and a bunch of outstanding TODOs. I find it quite difficult to understand the changes, assumptions, intentions and how the working solution would eventually look like.

I think it might be useful to go through the plan and code together with @pawanjay176, or re-review this once this is complete. What do you think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
das Data Availability Sampling under-review A reviewer has only partially completed a review. work-in-progress PR is a work-in-progress
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants