-
Notifications
You must be signed in to change notification settings - Fork 866
Hierarchical state diffs in hot DB #6750
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: unstable
Are you sure you want to change the base?
Conversation
d16f311
to
168d9f0
Compare
I have done some testing on this on the Holesky testnet as mentioned in #6775. Regular checkpoint sync works, so does syncing from a checkpoint >24 hours ago. Some findings:
(The first line of this is until slot 3498015, I truncated it)
The beacon node wouldn't be able to start upon a restart:
Another point worth noting is that the logs seem to be in a loop. Similar logs appear repeatedly in the log file, to the extent that it would only take a few seconds/minutes to fill up a 200MB log file. These are the repeated logs:
(The first line of the above log is truncated)
The beacon node wouldn't be able to start upon a restart:
(slot 3504704 is the anchor slot from the database info)
If this error occurs and I change back to v6.0.1 and let it syncs for a few minutes, and then change back again to this binary PR, it could work (upgrade is successful) and the database schema shows V23:
However, I notice the |
c79ddb3
to
8c9a1b2
Compare
This PR is amazing @dapplion, therefore you don't have to fix the conflicts yet. Please take your time and enjoy the day 💮 ❤️ |
Co-authored-by: Michael Sproul <[email protected]>
Should address outstanding comments from the OOM PR too: |
I've just resolved the merge conflicts after the merge of |
This pull request has merge conflicts. Could you please resolve them @dapplion? 🙏 |
* remove from db * remove dependent code * clippy happy * fixing clippy issues
slot: summary.slot, | ||
latest_block_root: summary.latest_block_root, | ||
epoch_boundary_state_root: state_summaries_dag | ||
.ancestor_state_root_at_slot(state_root, epoch_boundary_state_slot) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: check that this handles the genesis state/slot
TODO: remove Related to: |
.and_then(|parent_block_summaries| { | ||
parent_block_summaries.get(&previous_slot) | ||
}) | ||
.map_or(Hash256::ZERO, |(parent_state_root, _)| **parent_state_root) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than 0x0, we could use an enum here. I think we would also need this inside the HotStateSummary
.
Another thing we could do would be to differentiate nodes that we don't expect to have a parent state (the split state) and nodes that should have one (everything else).
Changed the pub struct DAGStateSummary {
...
pub previous_state_root: Result<Hash256, String>,
} |
Idea: if migration fails because the DAG of states is broken, could we just drop all unfinalized data? Basically, drop the entire fork-choice and prune the blocks from the DB. Then we have an infallible migration. Provided we have the best possible code, it's best to require sync from a few unlucky users than to get them stuck. We have seen that the chance of the DB having a broken DAG is low and not everyone will update at the same time. |
More thoughts: At the migration we need to copy the summaries to the new DB column with the new format. The new format requires adding:
We don't need to copy ALL hot summaries. Only those to preserve the invariant:
IDEA: We don't need to copy all summaries. We can ignore the summaries for advanced states, and the non-descendants of finalized block. We can also over-copy and future pruning rounds will get rid of them. Where to get previous_state_root from?
Filter to reduce chance of bugs
is this a correct way of detecting advanced states? struct Block {
slot: Slot,
highest_child: Option<Slot>,
}
fn is_advanced_state(state_summary: StateSummary) -> bool {
let block: Block = get_block(state_summary.latest_block_root);
match block.highest_child {
Some(highest_child) => state_summary.slot > highest_child,
None => state_summary.slot > block.slot,
}
} |
This is nice as it enables us to avoid recomputing diffs, we can just copy.
Proposed Changes
This PR implements #5978 (tree-states) but on the hot DB. It allows Lighthouse to massively reduce its disk footprint during non-finality and overall I/O in all cases.
Closes #6580
Conga into #6744
TODOs
TODO Good optimization for future PRs
TODO Maybe things good to have
NOTTODO
descendants_of_checkpoint
to filter only the state summaries that descend of finalized checkpoint]Additional Info
Original WIP PR with lots of context and discussion
Some resources to get up to speed with HDiff concepts: