chore: account for state availability when fetching sync committees #7435

Ayushdubey86 · 2025-05-10T14:57:17Z

Completely forgot about this, do check this @michaelsproul

Currently we have very poor coverage of range sync with unit tests. With the event driven test framework we could cover much more ground and be confident when modifying the code. Add two basic cases: - Happy path, complete a finalized sync for 2 epochs - Post-PeerDAS case where we start without enough custody peers and later we find enough ⚠️ If you have ideas for more test cases, please let me know! I'll write them

…able

Address misc PeerDAS TODOs that are not too big for a dedicated PR I'll justify each TODO on an inlined comment

Currently we track a key metric `PEERS_PER_COLUMN_SUBNET` in a getter `good_peers_on_sampling_subnets`. Another PR sigp#6922 deletes that function, so we have to move the metric anyway. This PR moves that metric computation to the metrics spawned task which is refreshed every 5 seconds. I also added a few more useful metrics. The total set and intended usage is: - `sync_peers_per_column_subnet`: Track health of overall subnets in your node - `sync_peers_per_custody_column_subnet`: Track health of the subnets your node needs. We should track this metric closely in our dashboards with a heatmap and bar plot - ~~`sync_column_subnets_with_zero_peers`: Is equivalent to the Grafana query `count(sync_peers_per_column_subnet == 0) by (instance)`. We may prefer to skip it, but I believe it's the most important metric as if `sync_column_subnets_with_zero_peers > 0` your node stalls.~~ - ~~`sync_custody_column_subnets_with_zero_peers`: `count(sync_peers_per_custody_column_subnet == 0) by (instance)`~~

- PR sigp#6497 made obsolete some consistency checks inside the batch I forgot to remove the consumers of those errors Remove un-used batch sync error condition, which was a nested `Result<_, Result<_, E>>`

Addresses sigp#6854. PeerDAS requires unsubscribing a Gossip topic at a fork boundary. This is not possible with our current topic machinery. Instead of defining which topics have to be **added** at a given fork, we define the complete set of topics at a given fork. The new start of the show and key function is: ```rust pub fn core_topics_to_subscribe<E: EthSpec>( fork_name: ForkName, opts: &TopicConfig, spec: &ChainSpec, ) -> Vec<GossipKind> { // ... if fork_name.deneb_enabled() && !fork_name.fulu_enabled() { // All of deneb blob topics are core topics for i in 0..spec.blob_sidecar_subnet_count(fork_name) { topics.push(GossipKind::BlobSidecar(i)); } } // ... } ``` `core_topics_to_subscribe` only returns the blob topics if `fork < Fulu`. Then at the fork boundary, we subscribe with the new fork digest to `core_topics_to_subscribe(next_fork)`, which excludes the blob topics. I added `is_fork_non_core_topic` to carry on to the next fork the aggregator topics for attestations and sync committee messages. This approach is future-proof if those topics ever become fork-dependent. Closes sigp#6854

…able

We don't need to store `BehaviourAction` for `ready_requests` and therefore avoid having an `unreachable!` on sigp#6625. Therefore this PR should be merged before it

Have blobs by default in deneb runs of kurtosis

@octocat

I keep being notified for PR's like sigp#7009 where it doesn't touch the specified directories in the `CODEOWNERS` file. After reading the [docs](https://docs.github.com/en/repositories/managing-your-repositorys-settings-and-features/customizing-your-repository/about-code-owners) not having a forward slash in beginning of the path means: > In this example, @octocat owns any file in an apps directory > anywhere in your repository. whereas with the slashes: > In this example, @doctocat owns any file in the `/docs` > directory in the root of your repository and any of its > subdirectories. this update makes it more rigid for the files in the jxs ownership

PeerDAS has undergone multiple refactors + the blending with the get_blobs optimization has generated technical debt. A function signature like this https://github.com/sigp/lighthouse/blob/f008b84079bbb6eb86de22bb3421dfc8263a5650/beacon_node/beacon_chain/src/beacon_chain.rs#L7171-L7178 Allows at least the following combination of states: - blobs: Some / None - data_columns: Some / None - data_column_recv: Some / None - Block has data? Yes / No - Block post-PeerDAS? Yes / No In reality, we don't have that many possible states, only: - `NoData`: pre-deneb, pre-PeerDAS with 0 blobs or post-PeerDAS with 0 blobs - `Blobs(BlobSidecarList<E>)`: post-Deneb pre-PeerDAS with > 0 blobs - `DataColumns(DataColumnSidecarList<E>)`: post-PeerDAS with > 0 blobs - `DataColumnsRecv(oneshot::Receiver<DataColumnSidecarList<E>>)`: post-PeerDAS with > 0 blobs, but we obtained the columns via reconstruction ^ this are the variants of the new `AvailableBlockData` enum So we go from 2^5 states to 4 well-defined. Downstream code benefits nicely from this clarity and I think it makes the whole feature much more maintainable. Currently `is_available` returns a bool, and then we construct the available block in `make_available`. In a way the availability condition is duplicated in both functions. Instead, this PR constructs `AvailableBlockData` in `is_available` so the availability conditions are written once ```rust if let Some(block_data) = is_available(..) { let available_block = make_available(block_data); } ```

…rDAS devnets (sigp#6947) Closes sigp#6919

Partly addresses - sigp#6959 Use the `enable_light_client_server` field from the beacon chain config in the HTTP API. I think we can make this the single source of truth, as I think the network crate also has access to the beacon chain config.

…ing (sigp#7030) Related to sigp#6880, an issue that's usually observed on local devnets with small number of nodes. When testing range sync, I usually shutdown a node for some period of time and restart it again. However, if it's within `SYNC_TOLERANCE_EPOCHS` (8), Lighthouse would consider the node as synced, and if it may attempt to produce a block if requested by a validator - on a local devnet, nodes frequently produce blocks - when this happens, the node ends up producing a block that would revert finality and would get disconnected from peers immediately. ### Usage Run Lighthouse BN with this flag to override: ``` --sync-tolerance--epoch 0 ```

Removes the WIP chain indexer known as beacon watch.

N/A Derive ssz::Encode and Decode on the `SignedValidatorRegistrationData` type to use in the builder

NA Bumps the `ethereum_ssz` version, along with other crates that share the dep. Primarily, this give us bitfields which can store 128 bytes on the stack before allocating, rather than 32 bytes (sigp/ethereum_ssz#38). The validator count has increase massively since we set it at 32 bytes, so aggregation bitfields (et al) now require a heap allocation. This new value of 128 should get us to ~2m active validators.

This updates to a version with the [compute_cells](https://github.com/ethereum/consensus-specs/blob/dev/specs/fulu/polynomial-commitments-sampling.md#compute_cells) method that allows you to extend a blob without creating a proof for it. Which issue # does this PR address? Please list or describe the changes introduced by this PR.

Delete duplicate sync tolerance epoch config in the HTTP API which is unused. We introduced the `sync-tolerance-epoch` flag in this PR: - sigp#7030 Then refined it in this PR: - sigp#7044 Somewhere in the merge of `release-v7.0.0` into `unstable`, the config from the original PR which had been deleted came back. I think I resolved these conflicts, so my bad.

We forked `gossipsub` into the lighthouse repo sometime ago so that we could iterate quicker on implementing back pressure and IDONTWANT. Meanwhile we have pushed all our changes upstream and we are now the main maintainers of `rust-libp2p` this allows us to use upstream `gossipsub` again. Nonetheless we still use our forked repo to give us freedom to experiment with features before submitting them upstream

@macladson

Tracing Integration - [reference](https://github.com/eth-protocol-fellows/cohort-five/blob/5bbf1859e921065bd69f8671038ed16643465b86/projects/project-ideas.md?plain=1#L297) - [x] replace slog & log with tracing throughout the codebase - [x] implement custom crit log - [x] make relevant changes in the formatter - [x] replace sloggers - [x] re-write SSE logging components cc: @macladson @eserilev

Currently range sync download log errors just say `error: rpc_error` which isn't helpful. The actual error is suppressed unless logged somewhere else. Log the actual error that caused the batch download to fail as part of the log that states that the batch download failed.

- sigp#6452 (partially) Remove dependencies on `store` and `lighthouse_network` from `eth2`. This was achieved as follows: - depend on `enr` and `multiaddr` directly instead of using `lighthouse_network`'s reexports. - make `lighthouse_network` responsible for converting between API and internal types. - in two cases, remove complex internal types and use the generic `serde_json::Value` instead - this is not ideal, but should be fine for now, as this affects two internal non-spec endpoints which are meant for debugging, unstable, and subject to change without notice anyway. Inspired by sigp#6679. The alternative is to move all relevant types to `eth2` or `types` instead - what do you think?

I feel it's preferable to do this explicitly by updating the revision on `Cargo.toml` rather than implicitly by letting `Cargo.lock` control the revision of the branch.

@macladson

It'll help us debug the issue with rolling file appender initialization (`Failed to initialize libp2p rolling file appender`) cc: @macladson

sigp#2573 Change release page display in dark theme. Before <img width="1028" alt="image" src="https://user-images.githubusercontent.com/574696/132262479-c1e5c904-576f-4878-8a10-6012e0c51620.png" /> After: <img width="1028" alt="image" src="https://github.com/user-attachments/assets/e6f42090-f9eb-4da7-9567-521124ea2f10" /> Others stay unchanged

Lighthouse currently lacks support for cross-compilation targeting the `riscv64` architecture. This PR introduces initial support for cross-compiling Lighthouse to `riscv64`. The following changes were made: - **Makefile**: Updated to support `cross` with `riscv64` as a target. - **Cross.toml**: Added configuration specific to `riscv64`. - **Documentation**: List 'build-riscv64' in `book/src/installation_cross_compiling.md`.

`spamoor_blob` is removed in ethpandaops/ethereum-package#972. When attempting to start local testnet, it will error: ` Evaluation error: fail: Invalid additional_services spamoor_blob, allowed fields: ["assertoor", "broadcaster", "tx_fuzz", "custom_flood", "forkmon", "blockscout", "dora", "full_beaconchain_explorer", "prometheus_grafana", "blobscan", "dugtrio", "blutgang", "forky", "apache", "tracoor", "spamoor"] ` This PR changes `spamoor_blob` to `spamoor`.

closes sigp#5785 The diagram below shows the differences in how the receiver (responder) behaves before and after this PR. The following sentences will detail the changes. ```mermaid flowchart TD subgraph "*** After ***" Start2([START]) --> AA[Receive request] AA --> COND1{Is there already an active request <br> with the same protocol?} COND1 --> |Yes| CC[Send error response] CC --> End2([END]) %% COND1 --> |No| COND2{Request is too large?} %% COND2 --> |Yes| CC COND1 --> |No| DD[Process request] DD --> EE{Rate limit reached?} EE --> |Yes| FF[Wait until tokens are regenerated] FF --> EE EE --> |No| GG[Send response] GG --> End2 end subgraph "*** Before ***" Start([START]) --> A[Receive request] A --> B{Rate limit reached <br> or <br> request is too large?} B -->|Yes| C[Send error response] C --> End([END]) B -->|No| E[Process request] E --> F[Send response] F --> End end ``` ### `Is there already an active request with the same protocol?` This check is not performed in `Before`. This is taken from the PR in the consensus-spec, which proposes updates regarding rate limiting and response timeout. https://github.com/ethereum/consensus-specs/pull/3767/files > The requester MUST NOT make more than two concurrent requests with the same ID. The PR mentions the requester side. In this PR, I introduced the `ActiveRequestsLimiter` for the `responder` side to restrict more than two requests from running simultaneously on the same protocol per peer. If the limiter disallows a request, the responder sends a rate-limited error and penalizes the requester. ### `Rate limit reached?` and `Wait until tokens are regenerated` UPDATE: I moved the limiter logic to the behaviour side. sigp#5923 (comment) ~~The rate limiter is shared between the behaviour and the handler. (`Arc<Mutex<RateLimiter>>>`) The handler checks the rate limit and queues the response if the limit is reached. The behaviour handles pruning.~~ ~~I considered not sharing the rate limiter between the behaviour and the handler, and performing all of these either within the behaviour or handler. However, I decided against this for the following reasons:~~ - ~~Regarding performing everything within the behaviour: The behaviour is unable to recognize the response protocol when `RPC::send_response()` is called, especially when the response is `RPCCodedResponse::Error`. Therefore, the behaviour can't rate limit responses based on the response protocol.~~ - ~~Regarding performing everything within the handler: When multiple connections are established with a peer, there could be multiple handlers interacting with that peer. Thus, we cannot enforce rate limiting per peer solely within the handler. (Any ideas? 🤔 )~~

sigp#6746 Add a --presign flag to emit the json output to stdout instead of publishing the exit

One of the information in the consolidation section in Lighthouse book is wrong. I realise this after reading https://ethereum.org/en/roadmap/pectra/maxeb/ and a further look at [EIP 7251](https://eips.ethereum.org/EIPS/eip-7251) which states: ` Note: the system contract uses the EVM CALLER operation (Solidity: msg.sender) to get the address used in the consolidation request, i.e. the address that calls the system contract must match the 0x01 withdrawal credential recorded in the beacon state. ` So the withdrawal credentials of both source and target validators need not be the same.

…evnet (sigp#6956)

Increases default gas limit to 36M.

Changes the endpoint to get fallback health information from `/lighthouse/ui/fallback_health` to `/lighthouse/beacon/health`. This more accurately describes that the endpoint is related to the connected beacon nodes and also matched the `/lighthouse/beacon/update` endpoint being added in sigp#6551. Adds documentation for both fallback health and the endpoint to the Lighthouse book.

When we perform data column gossip verification, we sometimes see multiple proposer shuffling cache miss simultaneously and this results in multiple threads computing the shuffling cache and potentially slows down the gossip verification. Proposal here is to use a `OnceCell` for each shuffling key to make sure it's only computed once. I have only implemented this in data column verification as a PoC, but this can also be applied to blob and block verification Related issues: - sigp#4447 - sigp#7203

@pawanjay176

Debugging an sync issue from @pawanjay176 I'm missing some key info where instead of logging the ID of the SyncingChain we just log "Finalized" (the sync type). This looks like some typo or something was lost in translation when refactoring things. ``` Apr 17 12:12:00.707 DEBUG Syncing new finalized chain chain: Finalized, component: "range_sync" ``` This log should include more info about the new chain but just logs "Finalized" ``` Apr 17 12:12:00.810 DEBUG New chain added to sync peer_id: "16Uiu2HAmHP8QLYQJwZ4cjMUEyRgxzpkJF87qPgNecLTpUdruYbdA", sync_type: Finalized, new_chain: Finalized, component: "range_sync" ``` - Remove the Display impl and log the ID explicitly for all logs. - Log more details when creating a new SyncingChain

* sigp#7196

- Re-opens sigp#6864 targeting unstable Range sync and backfill sync still assume that each batch request is done by a single peer. This assumption breaks with PeerDAS, where we request custody columns to N peers. Issues with current unstable: - Peer prioritization counts batch requests per peer. This accounting is broken now, data columns by range request are not accounted - Peer selection for data columns by range ignores the set of peers on a syncing chain, instead draws from the global pool of peers - The implementation is very strict when we have no peers to request from. After PeerDAS this case is very common and we want to be flexible or easy and handle that case better than just hard failing everything. - [x] Upstream peer prioritization to the network context, it knows exactly how many active requests a peer (including columns by range) - [x] Upstream peer selection to the network context, now `block_components_by_range_request` gets a set of peers to choose from instead of a single peer. If it can't find a peer, it returns the error `RpcRequestSendError::NoPeer` - [ ] Range sync and backfill sync handle `RpcRequestSendError::NoPeer` explicitly - [ ] Range sync: leaves the batch in `AwaitingDownload` state and does nothing. **TODO**: we should have some mechanism to fail the chain if it's stale for too long - **EDIT**: Not done in this PR - [x] Backfill sync: pauses the sync until another peer joins - **EDIT**: Same logic as unstable ### TODOs - [ ] Add tests :) - [x] Manually test backfill sync Note: this touches the mainnet path!

- Create trait `ValidatorStore` with all functions used by the `validator_services` - Make `validator_services` generic on `S: ValidatorStore` - Introduce `LighthouseValidatorStore`, which has identical functionality to the old `ValidatorStore` - Remove dependencies (especially `environment`) from `validator_services` and `beacon_node_fallback` in order to be able to cleanly use them in Anchor

Prevent running `lighthouse vc --http-port <PORT>` without `--http`. Issue: sigp#7402 Added requires `--http` when using `lighthouse vc --http-port <PORT>`. Implemented a test code for this issue.

…ip network (sigp#7409) Don't publish data columns reconstructed from RPC columns to the gossip network, as this may result in peer downscoring if we're sending columns from past slots.

Add a default request timeout to all `BeaconNodeHttpClient` requests to ensure that no HTTP request can hang indefinitely.

* sigp#5410

This PR adds transitions to Electra ~~and Fulu~~ fork epochs in the simulator tests. ~~It also covers blob inclusion verification and data column syncing on a full node in Fulu.~~ UPDATE: Remove fulu fork from sim tests due to sigp#7199 (comment)

Closes sigp#6895 We need sync to retry custody requests when a peer CGC updates. A higher CGC can result in a data column subnet peer count increasing from 0 to 1, allowing requests to happen. Add new sync event `SyncMessage::UpdatedPeerCgc`. It's sent by the router when a metadata response updates the known CGC

Beacon logs in the simulator are printed only to stdout. The logs are usually large, so persisting them would be helpful for debugging. Added `--log-dir` parameter to the simulators and a step to upload the logs to Artifacts. (Update) Added `--disable-stdout-logging` to disable stdout logging, making the CI page cleaner.

@jimmygchen

updates to the latest gossipsub revision which includes libp2p/rust-libp2p#5868 cc @jimmygchen as I think you were interested in this feature

mergify · 2025-05-10T14:57:57Z

This pull request has merge conflicts. Could you please resolve them @Ayushdubey86? 🙏

dapplion and others added 30 commits February 10, 2025 07:55

Merge remote-tracking branch 'origin/release-v7.0.0-beta.0' into unst…

ec2fe38

…able

Fix misc PeerDAS todos (sigp#6862)

3992d6b

Address misc PeerDAS TODOs that are not too big for a dedicated PR I'll justify each TODO on an inlined comment

Remove un-used batch sync error condition (sigp#6917)

431dd7c

- PR sigp#6497 made obsolete some consistency checks inside the batch I forgot to remove the consumers of those errors Remove un-used batch sync error condition, which was a nested `Result<_, Result<_, E>>`

Merge remote-tracking branch 'origin/release-v7.0.0-beta.0' into unst…

6ab6eae

…able

Use RpcSend on RPC::self_limiter::ready_requests (sigp#6634)

193061f

We don't need to store `BehaviourAction` for `ready_requests` and therefore avoid having an `unreachable!` on sigp#6625. Therefore this PR should be merged before it

Merge remote-tracking branch 'origin/release-v7.0.0' into unstable

e5e43ec

Add spamoor_blob in network_params.yaml (sigp#7012)

b4be514

Have blobs by default in deneb runs of kurtosis

Expose blst internals (sigp#6829)

60964fc

feat: adds CLI flags to delay publishing for edge case testing on Pee…

6e11bdd

…rDAS devnets (sigp#6947) Closes sigp#6919

Merge remote-tracking branch 'origin/release-v7.0.0' into unstable

cf4104a

Cache validator registration only after successful publish (sigp#7034)

8a77252

Remove watch (sigp#7048)

1235d44

Removes the WIP chain indexer known as beacon watch.

Validator Registration ssz support (sigp#7081)

3bc5f1f

N/A Derive ssz::Encode and Decode on the `SignedValidatorRegistrationData` type to use in the builder

Merge remote-tracking branch 'origin/release-v7.0.0' into unstable

b4e79ed

update gossipsub to the latest upstream revision (sigp#7130)

c095a0a

I feel it's preferable to do this explicitly by updating the revision on `Cargo.toml` rather than implicitly by letting `Cargo.lock` control the revision of the branch.

Log file appender initialization errors properly (sigp#7137)

5cda164

It'll help us debug the issue with rolling file appender initialization (`Failed to initialize libp2p rolling file appender`) cc: @macladson

timfaner and others added 26 commits April 22, 2025 09:40

Revise logging when queue is full (sigp#7324)

c13e069

feat: presign for validator account (sigp#6747)

6fad186

sigp#6746 Add a --presign flag to emit the json output to stdout instead of publishing the exit

Changing boot_enr.yaml to expect bootstap_nodes.yaml for pectra d…

63a10ea

…evnet (sigp#6956)

vc: increase default gas limit (sigp#6897)

34a6c3a

Increases default gas limit to 36M.

Optimize validate_data_columns (sigp#7326)

9779b4b

docs: fix broken link to voluntary exit guide (sigp#7387)

c8224c8

Change slog to tracing in comments (sigp#7378)

43c38a6

* sigp#7196

Add requires --http when using vc subcommands --http-port (sigp#7405)

058dae0

Prevent running `lighthouse vc --http-port <PORT>` without `--http`. Issue: sigp#7402 Added requires `--http` when using `lighthouse vc --http-port <PORT>`. Implemented a test code for this issue.

Don't publish data columns reconstructed from RPC columns to the goss…

0f13029

…ip network (sigp#7409) Don't publish data columns reconstructed from RPC columns to the gossip network, as this may result in peer downscoring if we're sending columns from past slots.

Add a default timeout to all BeaconNodeHttpClient requests (sigp#7400)

8dc3d23

Add a default request timeout to all `BeaconNodeHttpClient` requests to ensure that no HTTP request can hang indefinitely.

Add ARM binary for macOS in release (sigp#7416)

e90fcbe

* sigp#5410

update gossipsub to the latest upstream revision (sigp#7284)

92391cd

updates to the latest gossipsub revision which includes libp2p/rust-libp2p#5868 cc @jimmygchen as I think you were interested in this feature

update sync_committees.rs

f157315

Ayushdubey86 requested a review from jxs as a code owner May 10, 2025 14:57

Ayushdubey86 closed this May 10, 2025

Ayushdubey86 deleted the chore-Account-for-state-availability-when-fetching-sync-committees branch May 10, 2025 15:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

chore: account for state availability when fetching sync committees #7435

chore: account for state availability when fetching sync committees #7435

Uh oh!

Ayushdubey86 commented May 10, 2025

Uh oh!

mergify bot commented May 10, 2025

Uh oh!

Uh oh!

chore: account for state availability when fetching sync committees #7435

chore: account for state availability when fetching sync committees #7435

Uh oh!

Conversation

Ayushdubey86 commented May 10, 2025

Uh oh!

mergify bot commented May 10, 2025

Uh oh!

Uh oh!