-
Notifications
You must be signed in to change notification settings - Fork 867
chore: account for state availability when fetching sync committees #7435
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Ayushdubey86
wants to merge
94
commits into
sigp:stable
from
Ayushdubey86:chore-Account-for-state-availability-when-fetching-sync-committees
Closed
chore: account for state availability when fetching sync committees #7435
Ayushdubey86
wants to merge
94
commits into
sigp:stable
from
Ayushdubey86:chore-Account-for-state-availability-when-fetching-sync-committees
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Currently we have very poor coverage of range sync with unit tests. With the event driven test framework we could cover much more ground and be confident when modifying the code. Add two basic cases: - Happy path, complete a finalized sync for 2 epochs - Post-PeerDAS case where we start without enough custody peers and later we find enough⚠️ If you have ideas for more test cases, please let me know! I'll write them
Address misc PeerDAS TODOs that are not too big for a dedicated PR I'll justify each TODO on an inlined comment
Currently we track a key metric `PEERS_PER_COLUMN_SUBNET` in a getter `good_peers_on_sampling_subnets`. Another PR sigp#6922 deletes that function, so we have to move the metric anyway. This PR moves that metric computation to the metrics spawned task which is refreshed every 5 seconds. I also added a few more useful metrics. The total set and intended usage is: - `sync_peers_per_column_subnet`: Track health of overall subnets in your node - `sync_peers_per_custody_column_subnet`: Track health of the subnets your node needs. We should track this metric closely in our dashboards with a heatmap and bar plot - ~~`sync_column_subnets_with_zero_peers`: Is equivalent to the Grafana query `count(sync_peers_per_column_subnet == 0) by (instance)`. We may prefer to skip it, but I believe it's the most important metric as if `sync_column_subnets_with_zero_peers > 0` your node stalls.~~ - ~~`sync_custody_column_subnets_with_zero_peers`: `count(sync_peers_per_custody_column_subnet == 0) by (instance)`~~
- PR sigp#6497 made obsolete some consistency checks inside the batch I forgot to remove the consumers of those errors Remove un-used batch sync error condition, which was a nested `Result<_, Result<_, E>>`
Addresses sigp#6854. PeerDAS requires unsubscribing a Gossip topic at a fork boundary. This is not possible with our current topic machinery. Instead of defining which topics have to be **added** at a given fork, we define the complete set of topics at a given fork. The new start of the show and key function is: ```rust pub fn core_topics_to_subscribe<E: EthSpec>( fork_name: ForkName, opts: &TopicConfig, spec: &ChainSpec, ) -> Vec<GossipKind> { // ... if fork_name.deneb_enabled() && !fork_name.fulu_enabled() { // All of deneb blob topics are core topics for i in 0..spec.blob_sidecar_subnet_count(fork_name) { topics.push(GossipKind::BlobSidecar(i)); } } // ... } ``` `core_topics_to_subscribe` only returns the blob topics if `fork < Fulu`. Then at the fork boundary, we subscribe with the new fork digest to `core_topics_to_subscribe(next_fork)`, which excludes the blob topics. I added `is_fork_non_core_topic` to carry on to the next fork the aggregator topics for attestations and sync committee messages. This approach is future-proof if those topics ever become fork-dependent. Closes sigp#6854
We don't need to store `BehaviourAction` for `ready_requests` and therefore avoid having an `unreachable!` on sigp#6625. Therefore this PR should be merged before it
Have blobs by default in deneb runs of kurtosis
I keep being notified for PR's like sigp#7009 where it doesn't touch the specified directories in the `CODEOWNERS` file. After reading the [docs](https://docs.github.com/en/repositories/managing-your-repositorys-settings-and-features/customizing-your-repository/about-code-owners) not having a forward slash in beginning of the path means: > In this example, @octocat owns any file in an apps directory > anywhere in your repository. whereas with the slashes: > In this example, @doctocat owns any file in the `/docs` > directory in the root of your repository and any of its > subdirectories. this update makes it more rigid for the files in the jxs ownership
PeerDAS has undergone multiple refactors + the blending with the get_blobs optimization has generated technical debt. A function signature like this https://github.com/sigp/lighthouse/blob/f008b84079bbb6eb86de22bb3421dfc8263a5650/beacon_node/beacon_chain/src/beacon_chain.rs#L7171-L7178 Allows at least the following combination of states: - blobs: Some / None - data_columns: Some / None - data_column_recv: Some / None - Block has data? Yes / No - Block post-PeerDAS? Yes / No In reality, we don't have that many possible states, only: - `NoData`: pre-deneb, pre-PeerDAS with 0 blobs or post-PeerDAS with 0 blobs - `Blobs(BlobSidecarList<E>)`: post-Deneb pre-PeerDAS with > 0 blobs - `DataColumns(DataColumnSidecarList<E>)`: post-PeerDAS with > 0 blobs - `DataColumnsRecv(oneshot::Receiver<DataColumnSidecarList<E>>)`: post-PeerDAS with > 0 blobs, but we obtained the columns via reconstruction ^ this are the variants of the new `AvailableBlockData` enum So we go from 2^5 states to 4 well-defined. Downstream code benefits nicely from this clarity and I think it makes the whole feature much more maintainable. Currently `is_available` returns a bool, and then we construct the available block in `make_available`. In a way the availability condition is duplicated in both functions. Instead, this PR constructs `AvailableBlockData` in `is_available` so the availability conditions are written once ```rust if let Some(block_data) = is_available(..) { let available_block = make_available(block_data); } ```
Partly addresses - sigp#6959 Use the `enable_light_client_server` field from the beacon chain config in the HTTP API. I think we can make this the single source of truth, as I think the network crate also has access to the beacon chain config.
…ing (sigp#7030) Related to sigp#6880, an issue that's usually observed on local devnets with small number of nodes. When testing range sync, I usually shutdown a node for some period of time and restart it again. However, if it's within `SYNC_TOLERANCE_EPOCHS` (8), Lighthouse would consider the node as synced, and if it may attempt to produce a block if requested by a validator - on a local devnet, nodes frequently produce blocks - when this happens, the node ends up producing a block that would revert finality and would get disconnected from peers immediately. ### Usage Run Lighthouse BN with this flag to override: ``` --sync-tolerance--epoch 0 ```
N/A Derive ssz::Encode and Decode on the `SignedValidatorRegistrationData` type to use in the builder
NA Bumps the `ethereum_ssz` version, along with other crates that share the dep. Primarily, this give us bitfields which can store 128 bytes on the stack before allocating, rather than 32 bytes (sigp/ethereum_ssz#38). The validator count has increase massively since we set it at 32 bytes, so aggregation bitfields (et al) now require a heap allocation. This new value of 128 should get us to ~2m active validators.
This updates to a version with the [compute_cells](https://github.com/ethereum/consensus-specs/blob/dev/specs/fulu/polynomial-commitments-sampling.md#compute_cells) method that allows you to extend a blob without creating a proof for it. Which issue # does this PR address? Please list or describe the changes introduced by this PR.
Delete duplicate sync tolerance epoch config in the HTTP API which is unused. We introduced the `sync-tolerance-epoch` flag in this PR: - sigp#7030 Then refined it in this PR: - sigp#7044 Somewhere in the merge of `release-v7.0.0` into `unstable`, the config from the original PR which had been deleted came back. I think I resolved these conflicts, so my bad.
We forked `gossipsub` into the lighthouse repo sometime ago so that we could iterate quicker on implementing back pressure and IDONTWANT. Meanwhile we have pushed all our changes upstream and we are now the main maintainers of `rust-libp2p` this allows us to use upstream `gossipsub` again. Nonetheless we still use our forked repo to give us freedom to experiment with features before submitting them upstream
Tracing Integration - [reference](https://github.com/eth-protocol-fellows/cohort-five/blob/5bbf1859e921065bd69f8671038ed16643465b86/projects/project-ideas.md?plain=1#L297) - [x] replace slog & log with tracing throughout the codebase - [x] implement custom crit log - [x] make relevant changes in the formatter - [x] replace sloggers - [x] re-write SSE logging components cc: @macladson @eserilev
Currently range sync download log errors just say `error: rpc_error` which isn't helpful. The actual error is suppressed unless logged somewhere else. Log the actual error that caused the batch download to fail as part of the log that states that the batch download failed.
- sigp#6452 (partially) Remove dependencies on `store` and `lighthouse_network` from `eth2`. This was achieved as follows: - depend on `enr` and `multiaddr` directly instead of using `lighthouse_network`'s reexports. - make `lighthouse_network` responsible for converting between API and internal types. - in two cases, remove complex internal types and use the generic `serde_json::Value` instead - this is not ideal, but should be fine for now, as this affects two internal non-spec endpoints which are meant for debugging, unstable, and subject to change without notice anyway. Inspired by sigp#6679. The alternative is to move all relevant types to `eth2` or `types` instead - what do you think?
I feel it's preferable to do this explicitly by updating the revision on `Cargo.toml` rather than implicitly by letting `Cargo.lock` control the revision of the branch.
It'll help us debug the issue with rolling file appender initialization (`Failed to initialize libp2p rolling file appender`) cc: @macladson
sigp#2573 Change release page display in dark theme. Before <img width="1028" alt="image" src="https://user-images.githubusercontent.com/574696/132262479-c1e5c904-576f-4878-8a10-6012e0c51620.png" /> After: <img width="1028" alt="image" src="https://github.com/user-attachments/assets/e6f42090-f9eb-4da7-9567-521124ea2f10" /> Others stay unchanged
Lighthouse currently lacks support for cross-compilation targeting the `riscv64` architecture. This PR introduces initial support for cross-compiling Lighthouse to `riscv64`. The following changes were made: - **Makefile**: Updated to support `cross` with `riscv64` as a target. - **Cross.toml**: Added configuration specific to `riscv64`. - **Documentation**: List 'build-riscv64' in `book/src/installation_cross_compiling.md`.
`spamoor_blob` is removed in ethpandaops/ethereum-package#972. When attempting to start local testnet, it will error: ` Evaluation error: fail: Invalid additional_services spamoor_blob, allowed fields: ["assertoor", "broadcaster", "tx_fuzz", "custom_flood", "forkmon", "blockscout", "dora", "full_beaconchain_explorer", "prometheus_grafana", "blobscan", "dugtrio", "blutgang", "forky", "apache", "tracoor", "spamoor"] ` This PR changes `spamoor_blob` to `spamoor`.
closes sigp#5785 The diagram below shows the differences in how the receiver (responder) behaves before and after this PR. The following sentences will detail the changes. ```mermaid flowchart TD subgraph "*** After ***" Start2([START]) --> AA[Receive request] AA --> COND1{Is there already an active request <br> with the same protocol?} COND1 --> |Yes| CC[Send error response] CC --> End2([END]) %% COND1 --> |No| COND2{Request is too large?} %% COND2 --> |Yes| CC COND1 --> |No| DD[Process request] DD --> EE{Rate limit reached?} EE --> |Yes| FF[Wait until tokens are regenerated] FF --> EE EE --> |No| GG[Send response] GG --> End2 end subgraph "*** Before ***" Start([START]) --> A[Receive request] A --> B{Rate limit reached <br> or <br> request is too large?} B -->|Yes| C[Send error response] C --> End([END]) B -->|No| E[Process request] E --> F[Send response] F --> End end ``` ### `Is there already an active request with the same protocol?` This check is not performed in `Before`. This is taken from the PR in the consensus-spec, which proposes updates regarding rate limiting and response timeout. https://github.com/ethereum/consensus-specs/pull/3767/files > The requester MUST NOT make more than two concurrent requests with the same ID. The PR mentions the requester side. In this PR, I introduced the `ActiveRequestsLimiter` for the `responder` side to restrict more than two requests from running simultaneously on the same protocol per peer. If the limiter disallows a request, the responder sends a rate-limited error and penalizes the requester. ### `Rate limit reached?` and `Wait until tokens are regenerated` UPDATE: I moved the limiter logic to the behaviour side. sigp#5923 (comment) ~~The rate limiter is shared between the behaviour and the handler. (`Arc<Mutex<RateLimiter>>>`) The handler checks the rate limit and queues the response if the limit is reached. The behaviour handles pruning.~~ ~~I considered not sharing the rate limiter between the behaviour and the handler, and performing all of these either within the behaviour or handler. However, I decided against this for the following reasons:~~ - ~~Regarding performing everything within the behaviour: The behaviour is unable to recognize the response protocol when `RPC::send_response()` is called, especially when the response is `RPCCodedResponse::Error`. Therefore, the behaviour can't rate limit responses based on the response protocol.~~ - ~~Regarding performing everything within the handler: When multiple connections are established with a peer, there could be multiple handlers interacting with that peer. Thus, we cannot enforce rate limiting per peer solely within the handler. (Any ideas? 🤔 )~~
sigp#6746 Add a --presign flag to emit the json output to stdout instead of publishing the exit
One of the information in the consolidation section in Lighthouse book is wrong. I realise this after reading https://ethereum.org/en/roadmap/pectra/maxeb/ and a further look at [EIP 7251](https://eips.ethereum.org/EIPS/eip-7251) which states: ` Note: the system contract uses the EVM CALLER operation (Solidity: msg.sender) to get the address used in the consolidation request, i.e. the address that calls the system contract must match the 0x01 withdrawal credential recorded in the beacon state. ` So the withdrawal credentials of both source and target validators need not be the same.
Increases default gas limit to 36M.
Changes the endpoint to get fallback health information from `/lighthouse/ui/fallback_health` to `/lighthouse/beacon/health`. This more accurately describes that the endpoint is related to the connected beacon nodes and also matched the `/lighthouse/beacon/update` endpoint being added in sigp#6551. Adds documentation for both fallback health and the endpoint to the Lighthouse book.
When we perform data column gossip verification, we sometimes see multiple proposer shuffling cache miss simultaneously and this results in multiple threads computing the shuffling cache and potentially slows down the gossip verification. Proposal here is to use a `OnceCell` for each shuffling key to make sure it's only computed once. I have only implemented this in data column verification as a PoC, but this can also be applied to blob and block verification Related issues: - sigp#4447 - sigp#7203
Debugging an sync issue from @pawanjay176 I'm missing some key info where instead of logging the ID of the SyncingChain we just log "Finalized" (the sync type). This looks like some typo or something was lost in translation when refactoring things. ``` Apr 17 12:12:00.707 DEBUG Syncing new finalized chain chain: Finalized, component: "range_sync" ``` This log should include more info about the new chain but just logs "Finalized" ``` Apr 17 12:12:00.810 DEBUG New chain added to sync peer_id: "16Uiu2HAmHP8QLYQJwZ4cjMUEyRgxzpkJF87qPgNecLTpUdruYbdA", sync_type: Finalized, new_chain: Finalized, component: "range_sync" ``` - Remove the Display impl and log the ID explicitly for all logs. - Log more details when creating a new SyncingChain
- Re-opens sigp#6864 targeting unstable Range sync and backfill sync still assume that each batch request is done by a single peer. This assumption breaks with PeerDAS, where we request custody columns to N peers. Issues with current unstable: - Peer prioritization counts batch requests per peer. This accounting is broken now, data columns by range request are not accounted - Peer selection for data columns by range ignores the set of peers on a syncing chain, instead draws from the global pool of peers - The implementation is very strict when we have no peers to request from. After PeerDAS this case is very common and we want to be flexible or easy and handle that case better than just hard failing everything. - [x] Upstream peer prioritization to the network context, it knows exactly how many active requests a peer (including columns by range) - [x] Upstream peer selection to the network context, now `block_components_by_range_request` gets a set of peers to choose from instead of a single peer. If it can't find a peer, it returns the error `RpcRequestSendError::NoPeer` - [ ] Range sync and backfill sync handle `RpcRequestSendError::NoPeer` explicitly - [ ] Range sync: leaves the batch in `AwaitingDownload` state and does nothing. **TODO**: we should have some mechanism to fail the chain if it's stale for too long - **EDIT**: Not done in this PR - [x] Backfill sync: pauses the sync until another peer joins - **EDIT**: Same logic as unstable ### TODOs - [ ] Add tests :) - [x] Manually test backfill sync Note: this touches the mainnet path!
- Create trait `ValidatorStore` with all functions used by the `validator_services` - Make `validator_services` generic on `S: ValidatorStore` - Introduce `LighthouseValidatorStore`, which has identical functionality to the old `ValidatorStore` - Remove dependencies (especially `environment`) from `validator_services` and `beacon_node_fallback` in order to be able to cleanly use them in Anchor
Prevent running `lighthouse vc --http-port <PORT>` without `--http`. Issue: sigp#7402 Added requires `--http` when using `lighthouse vc --http-port <PORT>`. Implemented a test code for this issue.
…ip network (sigp#7409) Don't publish data columns reconstructed from RPC columns to the gossip network, as this may result in peer downscoring if we're sending columns from past slots.
Add a default request timeout to all `BeaconNodeHttpClient` requests to ensure that no HTTP request can hang indefinitely.
This PR adds transitions to Electra ~~and Fulu~~ fork epochs in the simulator tests. ~~It also covers blob inclusion verification and data column syncing on a full node in Fulu.~~ UPDATE: Remove fulu fork from sim tests due to sigp#7199 (comment)
Closes sigp#6895 We need sync to retry custody requests when a peer CGC updates. A higher CGC can result in a data column subnet peer count increasing from 0 to 1, allowing requests to happen. Add new sync event `SyncMessage::UpdatedPeerCgc`. It's sent by the router when a metadata response updates the known CGC
Beacon logs in the simulator are printed only to stdout. The logs are usually large, so persisting them would be helpful for debugging. Added `--log-dir` parameter to the simulators and a step to upload the logs to Artifacts. (Update) Added `--disable-stdout-logging` to disable stdout logging, making the CI page cleaner.
updates to the latest gossipsub revision which includes libp2p/rust-libp2p#5868 cc @jimmygchen as I think you were interested in this feature
This pull request has merge conflicts. Could you please resolve them @Ayushdubey86? 🙏 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
closes #7115
Completely forgot about this, do check this @michaelsproul
ref #7178