Downscore and retry custody failures #7510

dapplion · 2025-05-22T05:17:34Z

Issue Addressed

Partially addresses

Improve range sync with PeerDAS #6258

Proposed Changes

TBD

1000 lines are for new unit tests :)

Questions / TODO

Should custody_by_range requests try all possible peers before giving up? i.e. should they ignore the failures counter for custody failures? Add a test for this behaviour.

Tests to add

Peer does not send columns at all on specific index
We find no-one serving columns on a specific index, how to recover
Random tests where most peer are faulty and we have to keep cycling them. Use randomness and run the tests with different levels of fault % :=> build a simulator
Permanently fail a single column many times, but then resolve the rest such that we can reconstruct.
Send non-matching data columns
Test downscoring of custody failures
Test the fallback mechanism of peer selection
Test a syncing chain reaching 0 peers, both for forward sync and backwards sync
Test forwards sync on Fulu with blocks without data

Future TODOs

Ensure uniform use of subscribe_all_data_column_subnets to determine is_supernode
Ensure consistent naming in sync network context handlers on_xxx_response, suggestion -> Downscore and retry custody failures #7510 (comment)
Cleanup add_peer_* functions in sync tests. Best to do in another PR as it touches a lot of lookup sync tests (unrelated)
Track requests_per_peer to ensure that eventually we fetch data from all peers in the chain

dapplion · 2025-05-22T05:38:42Z

beacon_node/network/src/sync/manager.rs

@@ -545,6 +545,9 @@ impl<T: BeaconChainTypes> SyncManager<T> {
        for (id, result) in self.network.continue_custody_by_root_requests() {
            self.on_custody_by_root_result(id, result);
        }
+        for (id, result) in self.network.continue_custody_by_range_requests() {
+            self.on_custody_by_range_result(id, result);
+        }


Every interval (15 sec), we call continue_custody_by_range / by_root requests, which will cause the request to error if it has been alive for too long. This allows the requests to not error immediately if they do not have enough custody peers.

dapplion · 2025-05-22T05:39:02Z

beacon_node/network/src/sync/manager.rs

@@ -442,6 +439,9 @@ impl<T: BeaconChainTypes> SyncManager<T> {
        for (id, result) in self.network.continue_custody_by_root_requests() {
            self.on_custody_by_root_result(id, result);
        }
+        for (id, result) in self.network.continue_custody_by_range_requests() {
+            self.on_custody_by_range_result(id, result);
+        }


Every time a peer joins, attempt to progress custody_by_root and custody_by_range requests

dapplion · 2025-05-22T05:40:04Z

beacon_node/network/src/sync/mod.rs

@@ -3,7 +3,6 @@
 //! Stores the various syncing methods for the beacon chain.
 mod backfill_sync;
 mod block_lookups;
-mod block_sidecar_coupling;


Logic moved to beacon_node/network/src/sync/network_context/block_components_by_range.rs

dapplion · 2025-05-22T05:40:41Z

beacon_node/network/src/sync/network_context.rs

 };

-pub mod custody;


Renamed existing custody module to custody_by_root and added a new one custody_by_range

dapplion · 2025-05-22T05:41:40Z

beacon_node/network/src/sync/network_context.rs

+    }
+
+    /// Returns the ids of all active requests
+    pub fn active_requests(&mut self) -> impl Iterator<Item = (SyncRequestId, &PeerId)> {


Changed this signature for tests, to have access to all active RPC requests

dapplion · 2025-05-22T05:42:14Z

beacon_node/network/src/sync/network_context.rs

-
-                Ok((requests, column_to_peer_map))
-            })
-            .transpose()?;


network_context no longer spawns _by_range requests, this logic is now inside BlockComponentsByRangeRequest

dapplion · 2025-05-22T05:43:55Z

beacon_node/network/src/sync/network_context/block_components_by_range.rs

+        blocks_by_range_request:
+            ByRangeRequest<BlocksByRangeRequestId, Vec<Arc<SignedBeaconBlock<E>>>>,
+        blobs_by_range_request: ByRangeRequest<BlobsByRangeRequestId, Vec<Arc<BlobSidecar<E>>>>,
+    },


Maintains the same behaviour for mainnet:

deneb: issue blocks + blobs requests at the same time

fulu: issue blocks request first, then columns

dapplion · 2025-05-22T06:10:00Z

beacon_node/network/src/sync/network_context.rs

 }

 #[derive(Debug, PartialEq, Eq)]
 pub enum RpcRequestSendError {
-    /// No peer available matching the required criteria
-    NoPeer(NoPeerError),


Requests do not error now on send if they don't have peers. Instead, custody_by_root and custody_by_range requests are left idle for some time, expecting peers.

jimmygchen

@dapplion I've started reviewing this but haven't got to the meat of the changes - will continue tomorrow but I've submitted my comments so far.

beacon_node/lighthouse_network/src/types/globals.rs

beacon_node/network/src/sync/manager.rs

beacon_node/network/src/sync/network_context.rs

beacon_node/network/src/sync/tests/lookups.rs

beacon_node/network/src/sync/tests/range.rs

beacon_node/network/src/sync/range_sync/chain.rs

jimmygchen · 2025-05-27T07:54:29Z

beacon_node/network/src/sync/range_sync/chain.rs

@@ -221,6 +256,12 @@ impl<T: BeaconChainTypes> SyncingChain<T> {
        request_id: Id,
        blocks: Vec<RpcBlock<T::EthSpec>>,
    ) -> ProcessingResult {
+        // Account for one more requests to this peer
+        // TODO(das): this code assumes that we do a single request per peer per RpcBlock


This would be true in the below case:

peerA: block peerB: col_1, col_2 peerC: col_2, col_3,

we get 1 req for each peer.

BUT for this below scenario:

peerA: block, col_5 peerB: col_1, col_2 peerC: col_2, col_3,

we still get 1 for each peer, which isn't fully correct.

Or we could just use BatchPeers and count block peer separately and correctly.

Either way I don't think it makes a big difference - I think we can get rid of this TODO and document the assumptions.

jimmygchen · 2025-05-27T08:06:11Z

beacon_node/network/src/sync/range_sync/chain.rs

-                batch.download_failed(Some(*peer_id))?
+                // TODO(das): Is it necessary for the batch to track failed peers? Can we make this
+                // mechanism compatible with PeerDAS and before PeerDAS?
+                batch.download_failed(None)?


We still the peer to track failed block peers, and this will break the current peer prioritisation for block and blobs. I think this needs a bit more thought.

After discussing with @pawanjay176 we decided to not track download failures. We rely on randomness to select a different peer as we expect the syncing chains to have a decent number of peers. For the custody by range requests there's an internal failed_peers set that will de-prioritize peers with network errors. The failed_peers at the batch level is still used to track peers that sent blocks or columns that failed processing and ensure that we retry from a different peer

jimmygchen · 2025-05-27T08:20:07Z

I've added a few more comments. I've spent quite a bit of time reading but I'm really struggling with reviewing this in the current state, with potential missing pieces and a bunch of outstanding TODOs. I find it quite difficult to understand the changes, assumptions, intentions and how the working solution would eventually look like.

I think it might be useful to go through the plan and code together with @pawanjay176, or re-review this once this is complete. What do you think?

…gesync

jimmygchen · 2025-06-03T08:24:47Z

I've just merged latest peerdas-devnet-7 into this branch and CI should pass now. Will do another round of review tomorrow.

I've triggerd a CI workflow to run the latest checkpoint sync test on this PR: https://github.com/jimmygchen/lighthouse/actions/runs/15412363885

jimmygchen · 2025-06-03T13:33:57Z

Looks like it's struggling with backfill - below link is a test run on sepolia - both nodes transitioned to Synced without completing backfill. The logs are also available on the CI workflow:
https://github.com/jimmygchen/lighthouse/actions/runs/15412363885/job/43368155087

On PeerDAS, supernode struggles to sync to head, but fullnode did complete backfilling 1000 slots in 506 seconds:
https://github.com/jimmygchen/lighthouse/actions/runs/15412363885/job/43368155070

Logs are available on the workflow summary page: https://github.com/jimmygchen/lighthouse/actions/runs/15412363885

jimmygchen · 2025-06-04T06:10:57Z

@dapplion Would you mind taking a look at the failing CI tests in the PR?

…p potentially faulty peers

…gesync

pawanjay176 · 2025-06-16T08:46:42Z

Not sure if this is ready for review? Its crashing on mainnet with

Jun 16 10:40:58.395 CRIT  Task panic. This is a bug!                    location: "/Users/pawan/ethereum/lighthouse/beacon_node/network/src/sync/network_context/block_components_by_range.
rs:461:18", msg_id: "TODO: don't do matching here: MissingBlobs", backtrace:    0: std::backtrace::Backtrace::create

Implement reliable range sync for PeerDAS

4fb2ae6

dapplion requested a review from jxs as a code owner May 22, 2025 05:17

dapplion added work-in-progress PR is a work-in-progress das Data Availability Sampling labels May 22, 2025

Resolve some TODOs

801659d

dapplion commented May 22, 2025

View reviewed changes

jimmygchen reviewed May 26, 2025

View reviewed changes

jimmygchen added the under-review A reviewer has only partially completed a review. label May 26, 2025

dapplion added 6 commits May 26, 2025 18:37

More comments

b383f7a

Reduce conversions

7d0fb93

Remove CustodyByRoot and CustodyByRange types

c8a0c9e

Improve RangeBlockComponent type

01329ab

Remove unused module

34b37b9

Use DataColumnSidecarList

8f74adc

jimmygchen reviewed May 27, 2025

View reviewed changes

dapplion added 7 commits May 27, 2025 12:21

Lint tests

86ad87e

Resolve TODO(das)

52722b7

Resolve more TODOs

fc3922f

Remove stale TODO

0ef95dd

Remove BatchStateSummary

144b83e

Address review comments

02d9737

Merge remote-tracking branch 'sigp/peerdas-devnet-7' into peerdas-ran…

c6b39e9

…gesync

jimmygchen force-pushed the peerdas-devnet-7 branch from 692ad81 to 42ef88b Compare June 3, 2025 08:20

jimmygchen force-pushed the peerdas-rangesync branch from af3d678 to c6b39e9 Compare June 3, 2025 08:20

Merge branch 'peerdas-devnet-7' into peerdas-rangesync

1b72871

jimmygchen added ready-for-review The code is ready for review and removed work-in-progress PR is a work-in-progress labels Jun 3, 2025

jimmygchen self-assigned this Jun 3, 2025

mergify bot added waiting-on-author The reviewer has suggested changes and awaits thier implementation. and removed ready-for-review The code is ready for review labels Jun 3, 2025

jimmygchen removed their assignment Jun 3, 2025

sigp deleted a comment from mergify bot Jun 3, 2025

jimmygchen added the syncing label Jun 3, 2025

Merge branch 'peerdas-devnet-7' into peerdas-rangesync

2b4a9bd

dapplion and others added 12 commits June 4, 2025 23:02

Fix finalized_sync_permanent_custody_peer_failure

ae0ef8f

Merge branch 'peerdas-devnet-7' into peerdas-rangesync

6f754bf

lint

28d9d8b

Remove total_requests_per_peer

7a03578

Fix failed_peers post fulu

4e13b3b

Don't use failed_peers for download errors, rely on randomness to ski…

e426e45

…p potentially faulty peers

Re-add NoPeers error

82c8e82

Merge branch 'peerdas-devnet-7' into peerdas-rangesync

a7a3457

Merge remote-tracking branch 'sigp/peerdas-devnet-7' into peerdas-ran…

8c8a812

…gesync

lint

56fcf28

lint

aa726cc

Add peers to backfill if FullySynced

cb5f76f

Downscore and retry custody failures #7510

Are you sure you want to change the base?

Downscore and retry custody failures #7510

Uh oh!

Conversation

dapplion commented May 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Issue Addressed

Proposed Changes

Questions / TODO

Future TODOs

Uh oh!

dapplion May 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dapplion May 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jimmygchen left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jimmygchen commented May 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jimmygchen commented Jun 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jimmygchen commented Jun 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jimmygchen commented Jun 4, 2025

Uh oh!

pawanjay176 commented Jun 16, 2025

Uh oh!

Uh oh!

dapplion commented May 22, 2025 •

edited

Loading

dapplion May 22, 2025 •

edited

Loading

dapplion May 22, 2025 •

edited

Loading

jimmygchen commented May 27, 2025 •

edited

Loading

jimmygchen commented Jun 3, 2025 •

edited

Loading

jimmygchen commented Jun 3, 2025 •

edited

Loading