Skip to content

Downscore and retry custody failures #7510

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 29 commits into
base: peerdas-devnet-7
Choose a base branch
from

Conversation

dapplion
Copy link
Collaborator

@dapplion dapplion commented May 22, 2025

Issue Addressed

Partially addresses

Proposed Changes

TBD

1000 lines are for new unit tests :)

Questions / TODO

  • Should custody_by_range requests try all possible peers before giving up? i.e. should they ignore the failures counter for custody failures? Add a test for this behaviour.

Tests to add

  • Peer does not send columns at all on specific index
  • We find no-one serving columns on a specific index, how to recover
  • Random tests where most peer are faulty and we have to keep cycling them. Use randomness and run the tests with different levels of fault % :=> build a simulator
  • Permanently fail a single column many times, but then resolve the rest such that we can reconstruct.
  • Send non-matching data columns
  • Test downscoring of custody failures
  • Test the fallback mechanism of peer selection
  • Test a syncing chain reaching 0 peers, both for forward sync and backwards sync
  • Test forwards sync on Fulu with blocks without data

Future TODOs

  • Ensure uniform use of subscribe_all_data_column_subnets to determine is_supernode
  • Ensure consistent naming in sync network context handlers on_xxx_response, suggestion -> Downscore and retry custody failures #7510 (comment)
  • Cleanup add_peer_* functions in sync tests. Best to do in another PR as it touches a lot of lookup sync tests (unrelated)
  • Track requests_per_peer to ensure that eventually we fetch data from all peers in the chain

@dapplion dapplion requested a review from jxs as a code owner May 22, 2025 05:17
@dapplion dapplion added work-in-progress PR is a work-in-progress das Data Availability Sampling labels May 22, 2025
@@ -545,6 +545,9 @@ impl<T: BeaconChainTypes> SyncManager<T> {
for (id, result) in self.network.continue_custody_by_root_requests() {
self.on_custody_by_root_result(id, result);
}
for (id, result) in self.network.continue_custody_by_range_requests() {
self.on_custody_by_range_result(id, result);
}
Copy link
Collaborator Author

@dapplion dapplion May 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Every interval (15 sec), we call continue_custody_by_range / by_root requests, which will cause the request to error if it has been alive for too long. This allows the requests to not error immediately if they do not have enough custody peers.

@@ -442,6 +439,9 @@ impl<T: BeaconChainTypes> SyncManager<T> {
for (id, result) in self.network.continue_custody_by_root_requests() {
self.on_custody_by_root_result(id, result);
}
for (id, result) in self.network.continue_custody_by_range_requests() {
self.on_custody_by_range_result(id, result);
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Every time a peer joins, attempt to progress custody_by_root and custody_by_range requests

@@ -3,7 +3,6 @@
//! Stores the various syncing methods for the beacon chain.
mod backfill_sync;
mod block_lookups;
mod block_sidecar_coupling;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logic moved to beacon_node/network/src/sync/network_context/block_components_by_range.rs

};

pub mod custody;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed existing custody module to custody_by_root and added a new one custody_by_range

}

/// Returns the ids of all active requests
pub fn active_requests(&mut self) -> impl Iterator<Item = (SyncRequestId, &PeerId)> {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed this signature for tests, to have access to all active RPC requests


Ok((requests, column_to_peer_map))
})
.transpose()?;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

network_context no longer spawns _by_range requests, this logic is now inside BlockComponentsByRangeRequest

blocks_by_range_request:
ByRangeRequest<BlocksByRangeRequestId, Vec<Arc<SignedBeaconBlock<E>>>>,
blobs_by_range_request: ByRangeRequest<BlobsByRangeRequestId, Vec<Arc<BlobSidecar<E>>>>,
},
Copy link
Collaborator Author

@dapplion dapplion May 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maintains the same behaviour for mainnet:

  • deneb: issue blocks + blobs requests at the same time
  • fulu: issue blocks request first, then columns

}

#[derive(Debug, PartialEq, Eq)]
pub enum RpcRequestSendError {
/// No peer available matching the required criteria
NoPeer(NoPeerError),
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requests do not error now on send if they don't have peers. Instead, custody_by_root and custody_by_range requests are left idle for some time, expecting peers.

Copy link
Member

@jimmygchen jimmygchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dapplion I've started reviewing this but haven't got to the meat of the changes - will continue tomorrow but I've submitted my comments so far.

@jimmygchen jimmygchen added the under-review A reviewer has only partially completed a review. label May 26, 2025
@@ -221,6 +256,12 @@ impl<T: BeaconChainTypes> SyncingChain<T> {
request_id: Id,
blocks: Vec<RpcBlock<T::EthSpec>>,
) -> ProcessingResult {
// Account for one more requests to this peer
// TODO(das): this code assumes that we do a single request per peer per RpcBlock
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be true in the below case:

peerA: block
peerB: col_1, col_2
peerC: col_2, col_3,

we get 1 req for each peer.

BUT for this below scenario:

peerA: block, col_5
peerB: col_1, col_2
peerC: col_2, col_3,

we still get 1 for each peer, which isn't fully correct.

Or we could just use BatchPeers and count block peer separately and correctly.

Either way I don't think it makes a big difference - I think we can get rid of this TODO and document the assumptions.

batch.download_failed(Some(*peer_id))?
// TODO(das): Is it necessary for the batch to track failed peers? Can we make this
// mechanism compatible with PeerDAS and before PeerDAS?
batch.download_failed(None)?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We still the peer to track failed block peers, and this will break the current peer prioritisation for block and blobs. I think this needs a bit more thought.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After discussing with @pawanjay176 we decided to not track download failures. We rely on randomness to select a different peer as we expect the syncing chains to have a decent number of peers. For the custody by range requests there's an internal failed_peers set that will de-prioritize peers with network errors. The failed_peers at the batch level is still used to track peers that sent blocks or columns that failed processing and ensure that we retry from a different peer

@jimmygchen
Copy link
Member

jimmygchen commented May 27, 2025

I've added a few more comments. I've spent quite a bit of time reading but I'm really struggling with reviewing this in the current state, with potential missing pieces and a bunch of outstanding TODOs. I find it quite difficult to understand the changes, assumptions, intentions and how the working solution would eventually look like.

I think it might be useful to go through the plan and code together with @pawanjay176, or re-review this once this is complete. What do you think?

@jimmygchen jimmygchen added ready-for-review The code is ready for review and removed work-in-progress PR is a work-in-progress labels Jun 3, 2025
@jimmygchen jimmygchen self-assigned this Jun 3, 2025
@jimmygchen
Copy link
Member

jimmygchen commented Jun 3, 2025

I've just merged latest peerdas-devnet-7 into this branch and CI should pass now. Will do another round of review tomorrow.

I've triggerd a CI workflow to run the latest checkpoint sync test on this PR: https://github.com/jimmygchen/lighthouse/actions/runs/15412363885

@mergify mergify bot added waiting-on-author The reviewer has suggested changes and awaits thier implementation. and removed ready-for-review The code is ready for review labels Jun 3, 2025
@jimmygchen
Copy link
Member

jimmygchen commented Jun 3, 2025

Looks like it's struggling with backfill - below link is a test run on sepolia - both nodes transitioned to Synced without completing backfill. The logs are also available on the CI workflow:
https://github.com/jimmygchen/lighthouse/actions/runs/15412363885/job/43368155087

On PeerDAS, supernode struggles to sync to head, but fullnode did complete backfilling 1000 slots in 506 seconds:
https://github.com/jimmygchen/lighthouse/actions/runs/15412363885/job/43368155070

Logs are available on the workflow summary page: https://github.com/jimmygchen/lighthouse/actions/runs/15412363885

@jimmygchen jimmygchen removed their assignment Jun 3, 2025
@sigp sigp deleted a comment from mergify bot Jun 3, 2025
@jimmygchen
Copy link
Member

@dapplion Would you mind taking a look at the failing CI tests in the PR?

@pawanjay176
Copy link
Member

Not sure if this is ready for review? Its crashing on mainnet with

Jun 16 10:40:58.395 CRIT  Task panic. This is a bug!                    location: "/Users/pawan/ethereum/lighthouse/beacon_node/network/src/sync/network_context/block_components_by_range.
rs:461:18", msg_id: "TODO: don't do matching here: MissingBlobs", backtrace:    0: std::backtrace::Backtrace::create

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
das Data Availability Sampling syncing under-review A reviewer has only partially completed a review. waiting-on-author The reviewer has suggested changes and awaits thier implementation.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants