Skip to content

floresta-chain: new optimized chainstore #251

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

Davidson-Souza
Copy link
Collaborator

The current chainstore is based on kv, but it has a few problems:

  • When we flush, we get a huge heap spike
  • We are getting a 2 or 3 times overhead on headers
  • It gets kinda slow to retrieve headers during IBD if we flush early

This commit introduces a bare-bones, ad-hock store that consists in two parts:

  • A open addressing, file backed and memory-mapped hash map to keep the relation block_hash -> block_height
  • A flat file that contains block headers serialized, in ascending order

To recover a header, given the block height, we simply use pointer arithmetic inside the flat file. If we need to get from the block hash, use the map first, then find it inside the flat file. This has the advantage of not needing explicit flushes (the os will flush it in fixed intervals), flushes are async (the os will do it), we get caching for free (mmap-ed pages will stay in memory if we need) and our cache can react to system constraints, because the kernel will always know how much memory we sill have

@JoseSK999
Copy link
Contributor

My understanding is that kv buckets also don't need explicit flushes. The data saved in a bucket is flushed to disk by the OS periodically, and it's kept in the inner sled pagecache (so we should have fast access), which is configured with a capacity of 100MB in KvChainStore::new.

@Davidson-Souza
Copy link
Collaborator Author

That was my understanding too, but for some reason, even before using the cache (second part of #169) it didn't really flush on its own, and if we had an unclean shutdown, we would lose our progress (or a good part of it). It would also become increasingly more CPU and IO heavy as we made progress, I suspect it's due to the block locator getting bigger as our chain grows.

But the biggest problem for me, that I couldn't find an alternative, was the heap spike. It would always crash on my phone, before or after #169. With this PR, it runs fine!

@JoseSK999
Copy link
Contributor

JoseSK999 commented Oct 8, 2024

It would also become increasingly more CPU and IO heavy as we made progress, I suspect it's due to the block locator getting bigger as our chain grows.

Isn't that the expected behavior of a node in IBD, as it moves from old empty blocks to more recent ones?

Also was the OS not flushing on its own on desktop or on mobile?

@Davidson-Souza
Copy link
Collaborator Author

Isn't that the expected behavior of a node in IBD, as it moves from old empty blocks to more recent ones?

Not this much, at least not for headers. They are small and have constant-size

Also was the OS not flushing on its own on desktop or on mobile?

Both. At least on my setup.

@Davidson-Souza Davidson-Souza force-pushed the new-chainstore branch 7 times, most recently from 4f43d68 to 9961e8c Compare January 2, 2025 22:05
@Davidson-Souza Davidson-Souza changed the title [WIP] floresta-chain: new optimized chainstore floresta-chain: new optimized chainstore Jan 3, 2025
@Davidson-Souza Davidson-Souza marked this pull request as ready for review January 3, 2025 00:26
Copy link
Contributor

@jaoleal jaoleal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice changes, heres mine superficial review... I still didnt finished.

Youre right, this needs a lot of testing and review.
It looks a nice job!

@Davidson-Souza Davidson-Souza force-pushed the new-chainstore branch 2 times, most recently from 9f33428 to 66f70e6 Compare February 11, 2025 16:19
Copy link
Contributor

@JoseSK999 JoseSK999 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some initial review, looks good so far! I still have to check flat_chain_store.rs

Copy link
Contributor

@JoseSK999 JoseSK999 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mainly nits, but I think I have found an issue with the update_block_index method impl, as explained below.

@Davidson-Souza Davidson-Souza force-pushed the new-chainstore branch 3 times, most recently from 5175e30 to 0597417 Compare March 5, 2025 17:54
@lucad70
Copy link
Contributor

lucad70 commented Mar 21, 2025

I have started a review aiming to measure the heap usage on both master and on this PR. However the tool I am using stops recording as soon as I give a kill sig to the florestad (I am using mac's Instruments for reference access this link).

This is a snapshot showing both builds (this PR and Master).
pr_analysis_start_of_node

I wanted to give a more detailed info on shutdown, as I figure out how to do it, there goes a few reviewing points:

  • On both builds there is a spike at the start up (I'll also look into it to figure out why)
  • This PR also improves heap usage overall during startup and regular process. Check the total bytes column to see the decrease from 5.59GiB to 5.02GiB on heap allocations, roughly 10% improvement overall on startup and regular usage.

@Davidson-Souza
Copy link
Collaborator Author

I have started a review aiming to measure the heap usage on both master and on this PR. However the tool I am using stops recording as soon as I give a kill sig to the florestad (I am using mac's Instruments for reference access this link).

This is a snapshot showing both builds (this PR and Master).
pr_analysis_start_of_node

I wanted to give a more detailed info on shutdown, as I figure out how to do it, there goes a few reviewing points:

  • On both builds there is a spike at the start up (I'll also look into it to figure out why)
  • This PR also improves heap usage overall during startup and regular process. Check the total bytes column to see the decrease from 5.59GiB to 5.02GiB on heap allocations, roughly 10% improvement overall on startup and regular usage.

Hey @lucad70 . Are you building floresta with the --experimental-db feature?

@lucad70
Copy link
Contributor

lucad70 commented Mar 24, 2025

Hey @lucad70 . Are you building floresta with the --experimental-db feature?

I am not. Should I be building it using the --experimental-db? Maybe I missed it while reading the documentation.

@lucad70
Copy link
Contributor

lucad70 commented Mar 24, 2025

Hey @lucad70 . Are you building floresta with the --experimental-db feature?

I am not. Should I be building it using the --experimental-db? Maybe I missed it while reading the documentation.

Okay, I have re-read here and I am now testing on my machine. I am also fixing a few mistakes I made while profiling.

  • I used the debug version, I should use release instead
  • I did not used the experimental-db
  • I did only one run for each attempt
  • I did not make sure my computer had no other program running

I am now addressing these issues and will return with more helpful data asap

Copy link
Contributor

@JoseSK999 JoseSK999 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few additional suggestions. I still have to review more deeply some things

@lucad70
Copy link
Contributor

lucad70 commented Mar 27, 2025

I have been doing profiling for a few days now. After acquiring more data, I used chatgpt to write a simple report. I will also post a few screenshots to show to profile of the heap usage for analysis as well. Here is the report:

Heap Allocation Analysis for Chainstore Modification

Introduction

This report presents a comparative heap allocation analysis between the current and proposed chainstore implementations. The profiling was performed using Instruments on MacOS, with five runs per test scenario (excluding the most discrepant run). The analysis evaluates heap allocations over different time spans: 2 minutes, 12 minutes, and 6 hours.

The two versions under comparison are:

  • Normal (KV): The existing chainstore implementation.
  • DB: The updated implementation.

The key memory metrics analyzed include:

  • Persistent: Memory allocations still present at the end of execution.
  • # Persistent: Count of persistent allocations.
  • # Transient: Count of transient allocations (allocated and freed within the run).
  • Total Bytes: Total memory allocated during the execution.
  • # Total: Total number of allocations performed.

2-Minute Execution

Normal Implementation

Run Persistent (MiB) # Persistent # Transient Total Bytes (GiB) # Total
1 209.54 1,462,945 33,548,373 5.59 35,011,318
2 207.61 1,448,843 31,204,050 5.21 32,652,893
3 207.55 1,451,203 32,536,063 5.42 33,987,266
4 207.82 1,450,247 30,959,519 5.18 32,409,766
5 207.59 1,448,941 31,527,780 5.26 32,976,721
Average 208.02 1,452,435.8 31,955,157 5.33 33,407,592.8

DB Implementation

Run Persistent (MiB) # Persistent # Transient Total Bytes (MiB) # Total
1 268.74 660,943 153,274 579.88 814,217
2 206.48 485,529 154,065 420.3 639,594
3 206.45 485,527 159,169 421.03 644,696
4 206.46 485,531 152,335 419.92 637,866
5 272.28 679,161 161,917 598.83 841,078
Average 232.08 559,338.2 156,152 487.99 715,490.2

Key Observations

  • The DB implementation has significantly fewer persistent allocations (~559k vs. ~1.45M).
  • The total transient allocations in DB are ~156k, whereas in Normal, they exceed 31M.
  • Total allocated memory in DB is significantly lower (487.99 MiB vs. 5.33 GiB).
  • The modified DB approach consumes ~11.4x less total bytes in this timeframe.

12-Minute Execution

Normal Implementation

Run Persistent (MiB) # Persistent # Transient Total Bytes (GiB) # Total
1 284.27 1,451,352 280,032,364 37.61 281,483,716
2 289.79 1,502,171 287,634,425 38.35 289,136,596
3 280.76 1,449,592 269,995,056 36.31 271,444,648
4 284.36 1,452,638 279,944,774 37.86 281,397,412
5 290.19 1,495,210 263,835,709 35.85 265,330,919
Average 285.87 1,470,192.6 276,288,465.6 37.20 277,758,658.2

DB Implementation

Run Persistent (MiB) # Persistent # Transient Total Bytes (MiB) # Total
1 206.52 485542 572618 488.46 1058160
2 272.34 679178 529953 658.13 1209131
3 206.59 485566 185185 425.83 670751
4 206.52 485547 548792 483.87 1034339
5 206.55 485576 517334 478.63 1002910
Average 219.704 524281.8 470776.4 506.984 995058.2

Key Observations

  • DB’s persistent allocations remain much lower (~524k vs. ~1.47M).
  • The transient allocations in DB (470k) are significantly lower than in Normal (~276M).
  • Total bytes allocated in DB are lower (506.98 MiB vs. 37.2 GiB).
  • DB's transient allocations are ~588x lower than Normal.

6-Hour Execution

DB Implementation Only

Run Persistent (MiB) # Persistent # Transient Total Bytes (GiB) # Total
1 206.61 485,609 3,714,025 1.28 4,199,634

Key Observations

  • The DB implementation sustained only ~3.7M transient allocations over 6 hours.
  • The total bytes allocated in this period were only 1.28 GiB, significantly better than Normal.

Overall Findings & Conclusion

  1. DB implementation significantly reduces persistent and transient allocations, preventing heap spikes.
  2. Total bytes allocated in DB are much lower (1.28 GiB over 6 hours vs. 5.33 GiB in 2 min for Normal).
  3. The transient allocations in DB are orders of magnitude lower (~470k vs. ~276M in 12 min).
  4. Normal implementation shows excessive heap usage, leading to inefficiencies.

Recommendation

The DB implementation is vastly superior in terms of heap efficiency. It reduces memory footprint, transient allocations, and overall heap pressure, making it the preferred approach for optimizing chainstore performance.


Here is a few screenshots
DB
DB
Normal
Normal

Copy link
Contributor

@JoseSK999 JoseSK999 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of other suggestions, I'm closer to a final review

@Davidson-Souza
Copy link
Collaborator Author

Updated with this diff

diff --git a/crates/floresta-chain/src/pruned_utreexo/flat_chain_store.rs b/crates/floresta-chain/src/pruned_utreexo/flat_chain_store.rs
index be41a88..d7ec29c 100644
--- a/crates/floresta-chain/src/pruned_utreexo/flat_chain_store.rs
+++ b/crates/floresta-chain/src/pruned_utreexo/flat_chain_store.rs
@@ -22,7 +22,7 @@
 //!
 //! We want to keep the load factor for the hash map as low as possible, while also avoiding
 //! re-hashing. So we pick a fairly big initial space to it, say 10 million buckets. Each bucket is
-//! 4 bytes long, so we have 40 MiB of map. Each [DiskBlockHeader] is 116 bytes long (80 bytes for
+//! 4 bytes long, so we have 40 MiB of map. Each [DiskBlockHeaderAndHash] is 116 bytes long (80 bytes for
 //! header + 4 for height + 32 for hash), so the maximum size for it, assuming 2.5 million headers
 //! (good for the next ~30 years), is 330 MiB. The smallest device I know that can run floresta
 //! has ~250 MiB of RAM, so we could  fit almost everything in memory. Given that newer blocks are
@@ -99,7 +99,6 @@ const UTREEXO_ACC_SIZE: usize = 32 * 64 + 8;
 
 #[derive(Clone)]
 /// Configuration for our flat chain store. See each field for more information
-
 pub struct FlatChainStoreConfig {
     /// The index map size, in buckets
     ///
@@ -137,7 +136,7 @@ pub struct FlatChainStoreConfig {
     /// The size of the fork headers file map, in headers
     ///
     /// This store keeps headers that are not in our main chain, but may be needed sometime. The
-    /// default value is having space for 1000 blocks.
+    /// default value is having space for 10,000 blocks.
     ///
     /// When computing the size in bytes, we will round the number to the nearest power of 2, minus
     /// 1. This lets us do some optimizations like use & instead of %, and use << instead of *.
@@ -346,7 +345,7 @@ pub struct FlatChainStore {
 /// and easy to read. It holds the memory map for the index, and the size of the index
 /// in buckets. It also keeps track of how many buckets are occupied, so we can re-hash
 /// the map when it gets full.
-pub struct BlockIndex {
+struct BlockIndex {
     /// The actual memory map for the index
     index_map: MmapMut,
 
@@ -367,7 +366,7 @@ impl BlockIndex {
     /// This function should only be called by [FlatChainStore::new], and it should never be called
     /// directly. It creates a new block index, given a memory map for the index, the size of the
     /// index in buckets, and how many buckets are occupied.
-    pub(super) fn new(index_map: MmapMut, index_size: usize, occupied: usize) -> Self {
+    fn new(index_map: MmapMut, index_size: usize, occupied: usize) -> Self {
         Self {
             index_map,
             index_size,
@@ -380,7 +379,7 @@ impl BlockIndex {
     /// If we have enough changes that we don't want to lose, we should flush the index map to
     /// disk. This will make sure that the index map is persisted, and we can recover it in case
     /// of a crash.
-    pub(super) fn flush(&self) -> Result<(), FlatChainstoreError> {
+    fn flush(&self) -> Result<(), FlatChainstoreError> {
         self.index_map.flush()?;
 
         Ok(())
@@ -732,7 +731,7 @@ impl FlatChainStore {
             return Err(FlatChainstoreError::NotFound);
         }
 
-        Ok(metadata.acc[0..size].iter().copied().collect())
+        Ok(metadata.acc[0..size].to_vec())
     }
 
     unsafe fn do_save_height(&self, best_block: BestChain) -> Result<(), FlatChainstoreError> {
@@ -868,7 +867,6 @@ impl FlatChainStore {
     }
 
     #[inline(always)]
-    #[must_use]
     #[doc(hidden)]
     fn get_cache_mut(&self) -> Result<MutexGuard<CacheType>, PoisonError<MutexGuard<CacheType>>> {
         self.cache.lock()
diff --git a/florestad/src/florestad.rs b/florestad/src/florestad.rs
index 0d5748c..6a1edfa 100644
--- a/florestad/src/florestad.rs
+++ b/florestad/src/florestad.rs
@@ -706,7 +706,7 @@ impl Florestad {
     #[cfg(feature = "experimental-db")]
     fn load_chain_store(data_dir: String) -> ChainStore {
         let config = FlatChainStoreConfig::new(data_dir + "/chaindata");
-        ChainStore::new(config.clone()).expect("failure while creating chainstate")
+        ChainStore::new(config).expect("failure while creating chainstate")
     }
 
     #[cfg(not(feature = "experimental-db"))]
@@ -715,7 +715,7 @@ impl Florestad {
         network: Network,
         assume_valid: Option<bitcoin::BlockHash>,
     ) -> ChainState<ChainStore<'static>> {
-        let db = ChainStore::new(data_dir.to_string()).expect("Could not read db");
+        let db = ChainStore::new(data_dir.clone()).expect("Could not read db");
         let assume_valid =
             assume_valid.map_or(AssumeValidArg::Hardcoded, AssumeValidArg::UserInput);
 
@@ -723,7 +723,7 @@ impl Florestad {
             Ok(chainstate) => chainstate,
             Err(err) => match err {
                 BlockchainError::ChainNotInitialized => {
-                    let db = ChainStore::new(data_dir.to_string()).expect("Could not read db");
+                    let db = ChainStore::new(data_dir).expect("Could not read db");
 
                     ChainState::<ChainStore>::new(db, network.into(), assume_valid)
                 }
@@ -746,7 +746,7 @@ impl Florestad {
             Ok(chainstate) => chainstate,
             Err(err) => match err {
                 BlockchainError::ChainNotInitialized => {
-                    let db = Self::load_chain_store(data_dir.clone());
+                    let db = Self::load_chain_store(data_dir);
 
                     ChainState::<ChainStore>::new(db, network.into(), assume_valid)
                 }

Copy link
Contributor

@JoseSK999 JoseSK999 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mainly changes for readability and clarity, and some suggested comments that I believe are important. My next review should be the last one.

Copy link
Contributor

@JoseSK999 JoseSK999 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to update, persist and correctly load the BlockIndex occupied count.

@Davidson-Souza
Copy link
Collaborator Author

Davidson-Souza commented Apr 17, 2025

@JoseSK999 I think the problem here is the API for ChainStore. It should mark methods like save_header as taking a mutable reference of self. Otherwise we end up in a situation where users can always call them in a thread unsafe way. If we want to expose those types, and allow users to implement their own chainstate, we must make sure this is not possible. For this PR tho, I think we can just leave as is for now.

For the map size, I'm using the fields inside metadata as you suggested. I'm also initializing the Metadata struct when we create a new chainstore

The current chainstore is based on `kv`, but it has a few problems:
  - When we flush, we get a huge heap spike
  - We are getting a 2 or 3 times overhead on headers
  - It gets kinda slow to retrieve headers during IBD if we flush early

This commit introduces a bare-bones, ad-hock store that consists in two
parts:
  - A open addressing, file backed and memory-mapped hash map to keep
    the relation block_hash -> block_height
  - A flat file that contains block headers serialized, in ascending
    order
  - A LRU cache to avoid going througth the map every time

To recover a header, given the block height, we simply use pointer
arithmetic inside the flat file. If we need to get from the block hash,
use the map first, then find it inside the flat file. This has the
advantage of not needing explicit flushes (the os will flush it in fixed
intervals), flushes are async (the os will do it), we get caching for
free (mmap-ed pages will stay in memory if we need) and our cache can
react to system constraints, because the kernel will always know how
much memory we sill have
@Davidson-Souza
Copy link
Collaborator Author

Rebased

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants