-
Notifications
You must be signed in to change notification settings - Fork 877
Implement basic validator custody framework (no backfill) #7578
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: unstable
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took a look at this as an opportunity to refamiliarise myself with the DAS code and spec. I hope I looked at the correct spec version x)
Also, just noting as I found no explicit TODO or code for this: I think we also need to update the ValidatorRegistrations if the effective balance changes due to beacon chain rewards, penalties, deposits and (partial) withdrawals.
.into(); | ||
debug!(?custody_context, "Persisting custody context to store"); | ||
|
||
if let Err(e) = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it necessary to clear the custody context here?
Thinking about scenarios where persist function below fails, then our we'd have nothing in the DB.
Also I think we should probably persist this more often, so we're crash safe? Perhaps when every time there's a change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah probably not necessary to clear it, I just copied the other persist functions. I'll fix that up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also I think we should probably persist this more often, so we're crash safe? Perhaps when every time there's a change?
Not sure its necessary. Its not catastrophic if the custody context doesn't get persisted. In case of an unclean shutdown, I think we might have bigger issues with other core beacon chain stuff not getting persisted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed clearing and persist custody context last.
beacon_chain | ||
.data_availability_checker | ||
.custody_context() | ||
.sampling_count(&beacon_chain.spec), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be the advertised custody group count ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, we actually want to subscribe to sample_count
number of subnets, but use advertised_custody_group
for metadata and ENR
… necessary and the node could just advertise immediately to allow the node to serve whatever it has stored.
/// Does not handle decreasing validator counts | ||
#[derive(Default, Debug)] | ||
struct ValidatorRegistrations { | ||
/// Set of all validators that is registered to this node along with its effective balance |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The effective balance at what state? The spec defines that we must use the effective balance of the latest finalized state. Do we have the guarantee that this balance is the expected one, and matches the current finalized state of this DB?
.store(new_validator_custody, Ordering::Relaxed); | ||
|
||
let updated_cgc = self.custody_group_count(spec); | ||
// Send the message to network only if there are more columns subnets to subscribe to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't it quite leaky for the beacon_chain
crate to be aware of the network and send it messages?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead you can move this code to the network where you recv the validator custody events and call the network after register_validators
in the HTTP api
for custody_index in &custody_groups { | ||
let columns = compute_columns_for_custody_group(*custody_index, &self.spec) | ||
.expect("should compute custody columns for node"); | ||
sampling_columns.extend(columns); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noting that now we have two sources of truth (3 items total) that we must carefully synchronize:
NetworkGlobals::sampling_subnets
NetworkGlobals::sampling_columns
DataAvailabilityChecker::custody_context
It's fine but having a single source of truth was the motivation of passing the cgc
around
registrations, | ||
chain.slot().unwrap(), | ||
&chain.spec, | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my POC I also used the builder registration, which is a much faster signal for those that use it
/// This should trigger downstream actions like setting | ||
/// a new cgc value in the enr and metadata fields and | ||
/// performing any related cleanup actions. | ||
AdvertisedCustodyCountChanged { new_custody_count: u64 }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Never used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some notes of mine: I came to define the AdvertisedCustodyCount
as:
Thanks to the CGC registry updates you can construct a function of CGC(time)
. Consider da_window
and now
units of time. Then consider the set da_window_CGCs
to contain all distinct values of CGC(time)
function between now - da_window
and now
. Then the AdvertisedCustodyCount = min(da_window_CGCs)
. This value represents the minimum CGC value that we can guarantee to have all expect columns for in our DB for the da window.
That is why it's important to set the registry update to an epoch in the future to guarantee the above.
slot: Slot, | ||
spec: &ChainSpec, | ||
) { | ||
let epoch = slot.epoch(E::slots_per_epoch()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The benefit of tracking the epoch in the updates is to uphold the following invariant
A given block will have the same CGC during its lifetime
To prevent situations where we receive the block from gossip with a CGC of 8 and the moment of importing turns out we are at 16, failing import. But the columns have already been gossiped (and ignored by us because we were CGC 8). In that case we have to trigger lookup sync to fetch the remaining columns and recover. It may work but it feels messy. Instead we can schedule the update for the next epoch to prevent this edge case.
Issue Addressed
Resolves #6767
Proposed Changes
This PR implements a basic version of validator custody.
CustodyContext
object which contains info regarding number of validators attached to a node and the custody count they contribute to the cgc.CustodyContext
is added in the da_checker and has methods for returning the current cgc and the number of columns to sample at head. Note that the logic for returning the cgc existed previously in the network globals.beacon_committee_subscriptions
endpoint. This might overestimate the number of validators actually publishing attestations from the node in the case of multi BN setups. We could also potentially use thepublish_attestations
endpoint to get a more conservative estimate at a later point.custody_group_count
due to addition/removal of validators, the custody context should send an event on a broadcast channnel. The only subscriber for the channel exists in the network service which simply subscribes to more subnets. There can be additional subscribers in sync that will start a backfill once the cgc changes.TODO
CustodyContext
that emits an event onceMIN_EPOCHS_FOR_BLOB_SIDECARS_REQUESTS
passes after updating the current cgc. This event should be picked up by a subscriber which updates the enr and metadata.