Skip to content

Implement basic validator custody framework (no backfill) #7578

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 18 commits into
base: unstable
Choose a base branch
from

Conversation

pawanjay176
Copy link
Member

Issue Addressed

Resolves #6767

Proposed Changes

This PR implements a basic version of validator custody.

  • It introduces a new CustodyContext object which contains info regarding number of validators attached to a node and the custody count they contribute to the cgc.
  • The CustodyContext is added in the da_checker and has methods for returning the current cgc and the number of columns to sample at head. Note that the logic for returning the cgc existed previously in the network globals.
  • To estimate the number of validators attached, we use the beacon_committee_subscriptions endpoint. This might overestimate the number of validators actually publishing attestations from the node in the case of multi BN setups. We could also potentially use the publish_attestations endpoint to get a more conservative estimate at a later point.
  • Anytime there's a change in the custody_group_count due to addition/removal of validators, the custody context should send an event on a broadcast channnel. The only subscriber for the channel exists in the network service which simply subscribes to more subnets. There can be additional subscribers in sync that will start a backfill once the cgc changes.

TODO

  • Currently, the logic only handles an increase in validator count and does not handle a decrease. We should ideally unsubscribe from subnets when the cgc has decreased.
  • Add a service in the CustodyContext that emits an event once MIN_EPOCHS_FOR_BLOB_SIDECARS_REQUESTS passes after updating the current cgc. This event should be picked up by a subscriber which updates the enr and metadata.
  • Add more tests

@pawanjay176 pawanjay176 requested a review from jxs as a code owner June 7, 2025 07:27
@pawanjay176 pawanjay176 added work-in-progress PR is a work-in-progress fulu Required for the upcoming Fulu hard fork labels Jun 7, 2025
Copy link
Member

@dknopik dknopik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took a look at this as an opportunity to refamiliarise myself with the DAS code and spec. I hope I looked at the correct spec version x)

Also, just noting as I found no explicit TODO or code for this: I think we also need to update the ValidatorRegistrations if the effective balance changes due to beacon chain rewards, penalties, deposits and (partial) withdrawals.

.into();
debug!(?custody_context, "Persisting custody context to store");

if let Err(e) =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it necessary to clear the custody context here?
Thinking about scenarios where persist function below fails, then our we'd have nothing in the DB.

Also I think we should probably persist this more often, so we're crash safe? Perhaps when every time there's a change?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah probably not necessary to clear it, I just copied the other persist functions. I'll fix that up.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I think we should probably persist this more often, so we're crash safe? Perhaps when every time there's a change?

Not sure its necessary. Its not catastrophic if the custody context doesn't get persisted. In case of an unclean shutdown, I think we might have bigger issues with other core beacon chain stuff not getting persisted.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed clearing and persist custody context last.

beacon_chain
.data_availability_checker
.custody_context()
.sampling_count(&beacon_chain.spec),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be the advertised custody group count ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we actually want to subscribe to sample_count number of subnets, but use advertised_custody_group for metadata and ENR

@jimmygchen jimmygchen changed the title Implement basic validator custody framework Implement basic validator custody framework (no backfill) Jun 10, 2025
/// Does not handle decreasing validator counts
#[derive(Default, Debug)]
struct ValidatorRegistrations {
/// Set of all validators that is registered to this node along with its effective balance
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The effective balance at what state? The spec defines that we must use the effective balance of the latest finalized state. Do we have the guarantee that this balance is the expected one, and matches the current finalized state of this DB?

.store(new_validator_custody, Ordering::Relaxed);

let updated_cgc = self.custody_group_count(spec);
// Send the message to network only if there are more columns subnets to subscribe to
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it quite leaky for the beacon_chain crate to be aware of the network and send it messages?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead you can move this code to the network where you recv the validator custody events and call the network after register_validators in the HTTP api

for custody_index in &custody_groups {
let columns = compute_columns_for_custody_group(*custody_index, &self.spec)
.expect("should compute custody columns for node");
sampling_columns.extend(columns);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noting that now we have two sources of truth (3 items total) that we must carefully synchronize:

  • NetworkGlobals::sampling_subnets
  • NetworkGlobals::sampling_columns
  • DataAvailabilityChecker::custody_context

It's fine but having a single source of truth was the motivation of passing the cgc around

registrations,
chain.slot().unwrap(),
&chain.spec,
);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my POC I also used the builder registration, which is a much faster signal for those that use it

/// This should trigger downstream actions like setting
/// a new cgc value in the enr and metadata fields and
/// performing any related cleanup actions.
AdvertisedCustodyCountChanged { new_custody_count: u64 },
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Never used?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some notes of mine: I came to define the AdvertisedCustodyCount as:

Thanks to the CGC registry updates you can construct a function of CGC(time). Consider da_window and now units of time. Then consider the set da_window_CGCs to contain all distinct values of CGC(time) function between now - da_window and now. Then the AdvertisedCustodyCount = min(da_window_CGCs). This value represents the minimum CGC value that we can guarantee to have all expect columns for in our DB for the da window.

That is why it's important to set the registry update to an epoch in the future to guarantee the above.

slot: Slot,
spec: &ChainSpec,
) {
let epoch = slot.epoch(E::slots_per_epoch());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The benefit of tracking the epoch in the updates is to uphold the following invariant

A given block will have the same CGC during its lifetime

To prevent situations where we receive the block from gossip with a CGC of 8 and the moment of importing turns out we are at 16, failing import. But the columns have already been gossiped (and ignored by us because we were CGC 8). In that case we have to trigger lookup sync to fetch the remaining columns and recover. It may work but it feels messy. Instead we can schedule the update for the next epoch to prevent this edge case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fulu Required for the upcoming Fulu hard fork work-in-progress PR is a work-in-progress
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants