Skip to content

Conversation

@imabdulbasit
Copy link
Contributor

@imabdulbasit imabdulbasit commented Nov 21, 2025

Currently the MembershipCoordinator will trigger a catchup request if the stake table is not present in memory. We do store the stake tables in add_epoch_root(). This PR adds fetching from storage if the stake table is not found in memory before triggering a catchup request. It adds a new trait EpochStateStorage in hotshot which contains all the storage methods related to membership including the DRB store/load. This allows us to remove alll the DRB callbacks from the MembershipCoordinator as well.

This also removes membership.reload_stake(RECENT_STAKE_TABLES_LIMIT).await; because we already reload stake tables when loading the consensus data at startup


@imabdulbasit imabdulbasit force-pushed the ab/load-membership-from-storage branch from 3d22457 to c1e0659 Compare November 21, 2025 16:48
@imabdulbasit imabdulbasit reopened this Nov 21, 2025
@imabdulbasit imabdulbasit changed the title load membership from storage before catchup Load membership from storage in MembershipCoordinator before triggering catchup Nov 21, 2025
@imabdulbasit imabdulbasit marked this pull request as ready for review November 21, 2025 18:47
Copy link
Contributor

@ss-es ss-es left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

partial review on the hotshot side, I'll keep looking later

Comment on lines +101 to +103
pub async fn compute_drb_result<TYPES: NodeType>(
drb_input: DrbInput,
store_drb_progress: StoreDrbProgressFn,
load_drb_progress: LoadDrbProgressFn,
storage: Arc<dyn EpochStateStorage<TYPES>>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's not the biggest deal, but personally I think this is a downgrade in code structure -- this is the core DRB logic and now we need a full TestStorage implementation to test it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this function is only used in epoch coordinator, and coordinator already has storage now so I don't see the point of having callbacks?

#[async_trait]
pub trait Storage<TYPES: NodeType>: Send + Sync + Clone + 'static {
pub trait Storage<TYPES: NodeType>:
Send + Sync + Clone + 'static + EpochStateStorage<TYPES>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if Storage requires EpochStateStorage, why do we have a separate trait? why not just combine them?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Storage is not Object Safe. Passing storage to Coordinator would require a generic parameter and I wanted to avoid that

&self,
epoch: TYPES::Epoch,
need_randomized: bool,
) -> Result<bool> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's not clear from the name + return type what the semantics of this function are. I think this could really use a comment explaining what it does and what it returns

also I really think Result<bool> is a bad return type, it's very hard to reason about. I think it would be better to make a custom enum or something here for the return type of insert_from_storage

async fn insert_from_storage(
&self,
epoch: TYPES::Epoch,
need_randomized: bool,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think an enum would make help clarity here, e.g. NeedsRandomized and SkipRandomized

Comment on lines 185 to 193
if self
.insert_from_storage(epoch, false)
.await
.inspect_err(|e| {
let _ = warn!("loading from storage for epoch {epoch:?} failed: {e}");
})?
{
return Ok(ret_val);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is correct. if we error from storage we should continue, not return

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also say it's not actually important that insert_from_storage is told whether we need_randomized. It should always query the DRB and load it if it exists. We can check the result afterwards, since has_randomized_stake_table() costs nothing to call.

imo, it would be clearer if:

  • insert_from_storage should be called catchup_from_storage, returning ()
  • after self.catchup_from_storage, perform the membership.read().await.has_stake_table or has_randomized_stake_table() check again to decide whether to return early

Comment on lines 144 to 152
if self
.insert_from_storage(epoch, true)
.await
.inspect_err(|e| {
let _ = warn!("loading from storage for {epoch:?} failed: {e}");
})?
{
return Ok(ret_val);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar comment here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants