Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[store] Introduce ChainStoreUpdateAdapter and EpochStoreUpdateAdapter and use in epoch_sync #12866

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

shreyan-gupta
Copy link
Contributor

This PR helps make progress in a couple of things

  • We introduce some basic set of functions in ChainStoreUpdateAdapter
  • We introduce the EpochStoreUpdateAdapter
  • We use these two adapters in the epoch_sync code
  • We set up basis for work to simplify the chain store update

...
We set up basis for work to simplify the chain store update... (how??)

This is because in my very long standing draft PR we try to establish that chain store update does not need to depend on "read your own write" philosophy.

Unfortunately the only place where this was causing a hindrance was in the epoch sync code due to some initialization complications and some unnecessary state sync code coming into play.

With the now simplified basic setup of store after epoch_sync, we no longer take the complex dependency on chain store and chain store update finalize function!!!

Copy link

codecov bot commented Feb 3, 2025

Codecov Report

Attention: Patch coverage is 79.23077% with 27 lines in your changes missing coverage. Please review.

Project coverage is 70.36%. Comparing base (6e4b731) to head (f750de1).
Report is 2 commits behind head on master.

Files with missing lines Patch % Lines
core/store/src/adapter/epoch_store.rs 34.37% 21 Missing ⚠️
chain/client/src/sync/epoch.rs 90.00% 0 Missing and 3 partials ⚠️
core/store/src/adapter/chain_store.rs 95.16% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master   #12866      +/-   ##
==========================================
+ Coverage   69.88%   70.36%   +0.48%     
==========================================
  Files         853      853              
  Lines      173553   174098     +545     
  Branches   173553   174098     +545     
==========================================
+ Hits       121281   122508    +1227     
+ Misses      47091    46381     -710     
- Partials     5181     5209      +28     
Flag Coverage Δ
backward-compatibility 0.35% <0.00%> (?)
db-migration 0.35% <0.00%> (?)
genesis-check 1.41% <0.00%> (?)
linux 70.06% <79.23%> (+0.17%) ⬆️
linux-nightly 70.00% <79.23%> (?)
pytests 1.72% <0.00%> (?)
sanity-checks 1.52% <0.00%> (?)
unittests 70.19% <79.23%> (+0.31%) ⬆️
upgradability 0.35% <0.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.


update.merge(store_update);
update.commit()?;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While the change above looks like a simple rewrite, it is definitely NOT trivial!! There are couple of things happening here that are different.

We are no longer using chain_store_update but instead using the standard store_update.

Originally we relied on functions like save_block_header_no_update_tree, force_save_header_head and save_final_head in chain store update along with update.commit()? which internally calls update.finalize().

(!!) With the new code, we are directly going and making changes to the store layer instead of relying on the weird crazy behavior of update.finalize()

We also need to save the new blocks into store, i.e. call chain_store_update.commit()?; before actually reading from the store right below that as we are no longer relying on the non-committed, temp to store values in update.get_block_header(...) part of the code.

I had to print out all the DB ops from the old code and the new code and compare them to check if we are doing things properly. This is tested well manually.

One tiny thing to be noted is that as part of the finalize behavior in the old chain store update code, we were calling crate::state_sync::update_sync_hashes(...) on all the blocks we were updating, but this isn't required as part of epoch sync. cc @marcelo-gonzalez.

One of those crazy things that's a byproduct of using the heavy hammer of chain store update... :(

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update_sync_hashes

Could you explain why? We likely want to sync state just after we finished epoch sync, so we need to make sure update_sync_hashes is called for some blocks, I just don't understand which ones.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran the original code in debug mode and we are calling this function below for last_header

fn on_new_epoch(store_update: &mut StoreUpdate, header: &BlockHeader) -> Result<(), Error> {
    let num_new_chunks = vec![0u8; header.chunk_mask().len()];
    store_update.set_ser(DBCol::StateSyncNewChunks, header.hash().as_ref(), &num_new_chunks)?;
    Ok(())
}

Here, update_sync_hashes only triggers at the epoch boundary, which happens in epoch sync at last_header, i.e. proof.current_epoch.first_block_header_in_epoch.

Here, the key we are adding is DBCol::StateSyncNewChunks and the description says something like this

    /// Stores a mapping from a block's hash to a list (indexed by ShardIndex like a header's chunk_mask())
    /// of the number of new chunks up to that block after the first block in the epoch. Used in calculating
    /// the right "sync_hash" for state sync after the StateSyncHashUpdate protocol feature is enabled.
    /// - *Rows*: `CryptoHash`
    /// - *Column type*: `Vec<u8>`
    StateSyncNewChunks,

Makes me believe it's not super important and would eventually be updated with the next block processing as well with a different code path.

However, even after all this, in case further convincing is required, after epoch sync, we anyway do block sync for the current epoch so it anyway doesn't make sense to think about state_sync related things for that epoch. state_sync related things would get reset with the next epoch processing.

@marcelo-gonzalez could you please check if this argument makes sense?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to print out all the DB ops from the old code and the new code and compare them to check if we are doing things properly. This is tested well manually.

Just to reiterate, even though this code path is different, I manually tested this by looking into the actual store_update db ops and the only difference was one entry, that of DBCol::StateSyncNewChunks that I described above.

Copy link
Contributor

@marcelo-gonzalez marcelo-gonzalez Feb 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah so in the previous code where we call save_block_header_no_update_tree on those headers in epoch sync, we first call it on proof.current_epoch.first_block_header_in_epoch, and like the name says, it's the first block in an epoch. So update_sync_hashes() will just set all zeros in the new chunks column for that block hash since it sees it's the first block in an epoch.

Then every other block header we're calling save_block_header_no_update_tree() on is in the previous epoch, and none of the other headers in that epoch have been saved before all this. This has us return early from update_sync_hashes() without doing anything bc of this check here. Note the comment about epoch sync there

So tldr, only proof.current_epoch.first_block_header_in_epoch actually does something when we call update_sync_hashes() and the others can safely be skipped. Actually there are several other comments in that file about epoch sync possibly meaning that some stuff has not been processed befeore. It would def simplify things if we could make this assumption: if update_sync_hashes() is called on some header, then if its prev header in the same epoch, then it has already been called previously on that prev header. Right now if that isn't the case we just silently return and assume it's bc of epoch sync, but it would be good to be able to WARN in that case that somethng is wrong

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if I understand correctly, this PR does make a behavior change where update_sync_hashes() will no longer be called on proof.current_epoch.first_block_header_in_epoch, but it's ok because that block is not going to be in the current epoch, so we header sync until we get to the current epoch and then update_sync_hashes() works as normal when we get to the new epoch

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^ Yes, that was my reasoning and wanted your confirmation!

self.store_update.set_ser(DBCol::BlockMisc, FINAL_HEAD_KEY, final_head).unwrap();
}

/// This function is normally clubbed with set_block_header_only
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But then we can just unite both methods.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I was thinking of doing that but thought of leaving the primitives. I had plans in the future to integrate functions like these into a helper module like, update_block or similar, but thought of leaving them as is for now.

I can follow up in a different PR


update.merge(store_update);
update.commit()?;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update_sync_hashes

Could you explain why? We likely want to sync state just after we finished epoch sync, so we need to make sure update_sync_hashes is called for some blocks, I just don't understand which ones.

Copy link
Contributor

@staffik staffik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like save_block_header_no_update_tree, force_save_header_head and save_final_head can be removed now.

Copy link
Contributor

@robin-near robin-near left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking the care to check everything @shreyan-gupta !

I love the direction this is going. When writing the epoch sync's apply_proof code, I don't know what I am doing. I'm basically populating whatever field I find necessary, using whatever method I can find that would populate these fields, and verifying that the resulting DB state appears to be consistent (no crashes, chain progresses, can help bootstrap another epoch synced node, etc.). I wasn't even aware of the sync hash behavior - I'm very glad we're moving away from that kind of implicit behavior.

We have python-based integration tests that test epoch sync, but if you could try epoch syncing a mainnet node (it should be easy to do that now that epoch sync has been released) this would be good to go for sure.

Self { store_update: StoreUpdateHolder::Reference(store_update) }
}

/// USE THIS FUNCTION WITH CARE; proceed only if you know what you're doing.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: down the road nobody is going to "know what they're doing". It will not be clear what "use with care" means. I suggest to just warn specifically what things can go wrong if you use it directly. (You already mentioned one, but maybe there are more and maybe what you mentioned can be explained a bit more.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants