Skip to content

feat: file-system based block storage #1825

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 10 commits into
base: v0.38.x-celestia
Choose a base branch
from

Conversation

yarikbratashchuk
Copy link
Contributor

@yarikbratashchuk yarikbratashchuk commented May 8, 2025

WIP

Closes #1812

@yarikbratashchuk yarikbratashchuk self-assigned this May 8, 2025
@github-project-automation github-project-automation bot moved this to Needs Triage in core/app May 8, 2025
@yarikbratashchuk yarikbratashchuk changed the title feat: Implement FileBlockStore for file-system based block storage feat: file-system based block storage May 8, 2025
@yarikbratashchuk yarikbratashchuk force-pushed the yarik/filebased-blockstore branch 4 times, most recently from f75a9ce to 7390d94 Compare May 9, 2025 09:33
@yarikbratashchuk yarikbratashchuk force-pushed the yarik/filebased-blockstore branch from 7390d94 to f31baa7 Compare May 12, 2025 14:21
@evan-forbes
Copy link
Member

to move conversations from the sprint here:
we saw a larger than expected decrease in performance. the flup is to get a better idea of where, why, and how much then we can make a decision on if we want to move forward with this feature.

@adlerjohn this PR isn't ready for review, but is it okay for the performance to be 5-10x lower? If not what overhead would be acceptable to move forward with the feature?

ref #1815

@adlerjohn
Copy link
Member

If this PR is coupled with #1815, and recent blocks use the current blockstore while ancient blocks are migrated to the file-based store, then the performance degradation shouldn't matter much since it's just for ancient blocks? Not sure though.

@yarikbratashchuk
Copy link
Contributor Author

Benchmark comparison against existing kv store implementation

Operation Implementation Ops/sec Time/op (ns) Memory/op (B) Allocs/op
SaveBlock File 2,756 47,716 19,234 297
SaveBlock DB 23,280 45,117 16,309 257
SaveBlockWithExtendedCommit File 3,188 438,360 19,242 297
SaveBlockWithExtendedCommit DB 28,812 42,884 16,311 257
LoadBlock File 20,575 55,799 6,445 73
LoadBlock DB 88,410 12,127 5,578 67
LoadBlockMeta File 41,584 24,547 2,394 26
LoadBlockMeta DB 305,709 3,299 1,432 20
LoadBlockPart File 97,412 15,113 1,711 14
LoadBlockPart DB 780,604 1,680 736 8
LoadBlockCommit File 42,385,448 28.45 0 0
LoadBlockCommit DB 5,927,785 193.8 128 1
LoadBlockExtendedCommit File 43,881,190 26.45 0 0
LoadBlockExtendedCommit DB 5,420,204 184.5 240 2
PruneBlocks File 3,772,339 350.2 80 2
PruneBlocks DB 4,488,566 417.7 81 2
DeleteLatestBlock File 7,120,083 152.8 40 2
DeleteLatestBlock DB 263,006 4,026 1,262 24

@adlerjohn
Copy link
Member

Oh wow, interesting stats. First, we definitely shouldn't merge this PR as-is.

On to the results. It looks like saving and loading blocks takes longer, but deleting blocks is much faster. That second part is a good sign! I'd guess with some optimizations, the time to save and load blocks can be brought down significantly. There's almost certainly some low-hanging unoptimal stuff there.

@tzdybal
Copy link
Member

tzdybal commented May 14, 2025

As we can assume SSD operation, we probably should make writes more parallel (each file in separate goroutine?) - another bonus is parallelization of serialization process for better CPU utilization. Depending on number of files and sized this might result in better performance.

@tzdybal
Copy link
Member

tzdybal commented May 14, 2025

Out of curiosity I ran the benchmark - results below:

BenchmarkBlockStore_SaveBlock
BenchmarkBlockStore_SaveBlock/file
BenchmarkBlockStore_SaveBlock/file-16                 	    9576	    128822 ns/op
BenchmarkBlockStore_SaveBlock/db
BenchmarkBlockStore_SaveBlock/db-16                   	   29248	     41971 ns/op
BenchmarkBlockStore_SaveBlockWithExtendedCommit
BenchmarkBlockStore_SaveBlockWithExtendedCommit/file
BenchmarkBlockStore_SaveBlockWithExtendedCommit/file-16         	    9808	    122564 ns/op
BenchmarkBlockStore_SaveBlockWithExtendedCommit/db
BenchmarkBlockStore_SaveBlockWithExtendedCommit/db-16           	   28830	     41520 ns/op
BenchmarkBlockStore_LoadBlock
BenchmarkBlockStore_LoadBlock/file
BenchmarkBlockStore_LoadBlock/file-16                           	   67908	     17675 ns/op
BenchmarkBlockStore_LoadBlock/db
BenchmarkBlockStore_LoadBlock/db-16                             	  208260	      5553 ns/op
BenchmarkBlockStore_LoadBlockMeta
BenchmarkBlockStore_LoadBlockMeta/file
BenchmarkBlockStore_LoadBlockMeta/file-16                       	  151254	      7661 ns/op
BenchmarkBlockStore_LoadBlockMeta/db
BenchmarkBlockStore_LoadBlockMeta/db-16                         	  763065	      1546 ns/op
BenchmarkBlockStore_LoadBlockPart
BenchmarkBlockStore_LoadBlockPart/file
BenchmarkBlockStore_LoadBlockPart/file-16                       	  190332	      6447 ns/op
BenchmarkBlockStore_LoadBlockPart/db
BenchmarkBlockStore_LoadBlockPart/db-16                         	 1606755	       737.9 ns/op
BenchmarkBlockStore_LoadBlockCommit
BenchmarkBlockStore_LoadBlockCommit/file
BenchmarkBlockStore_LoadBlockCommit/file-16                     	27788437	        42.28 ns/op
BenchmarkBlockStore_LoadBlockCommit/db
BenchmarkBlockStore_LoadBlockCommit/db-16                       	11688376	       102.8 ns/op
BenchmarkBlockStore_LoadBlockExtendedCommit
BenchmarkBlockStore_LoadBlockExtendedCommit/file
BenchmarkBlockStore_LoadBlockExtendedCommit/file-16             	27693708	        41.11 ns/op
BenchmarkBlockStore_LoadBlockExtendedCommit/db
BenchmarkBlockStore_LoadBlockExtendedCommit/db-16               	 8139055	       148.1 ns/op
BenchmarkBlockStore_PruneBlocks
BenchmarkBlockStore_PruneBlocks/file
BenchmarkBlockStore_PruneBlocks/file-16                         	 4628271	       242.3 ns/op
BenchmarkBlockStore_PruneBlocks/db
BenchmarkBlockStore_PruneBlocks/db-16                           	 4835184	       250.0 ns/op
BenchmarkBlockStore_DeleteLatestBlock
BenchmarkBlockStore_DeleteLatestBlock/file
BenchmarkBlockStore_DeleteLatestBlock/file-16                   	 9638091	       123.4 ns/op
BenchmarkBlockStore_DeleteLatestBlock/db
BenchmarkBlockStore_DeleteLatestBlock/db-16                     	  587707	      2064 ns/op
PASS

I would add: NVMe, XFS.

Differences are smaller (~3x vs ~10x for SaveBlock), but still significant.

// setupBlockStore creates a new block store for benchmarking
func setupBlockStore(b *testing.B, storeType string) (sm.State, interface{}, func()) {
config := test.ResetTestRoot("block_store_bench")
stateStore := sm.NewStore(dbm.NewMemDB(), sm.StoreOptions{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this test file based DB should be used, otherwise we're comparing saving to files vs saving to memory.

@Wondertan
Copy link
Member

We should not expect reads and writes to be faster with file based storage. Opening FD takes a while. This is one of the reasons DBs exists and prefer to open a single/limited set of files, like SQLite

@tzdybal
Copy link
Member

tzdybal commented May 14, 2025

I agree that we shouldn't expect files to be much faster than DB - there are overheads from OS and filesystem, and from DB engine (which also need to use OS and filesystem underneath).
To some extent the benchmark tests implementation of some fancy trees in DB/KV engine vs other fancy trees in filesystem 🌲

Very good read from SQLite - also reminds to keep internal fragmentation in mind.

@tzdybal
Copy link
Member

tzdybal commented May 14, 2025

And one more concern - it seems that there is a lot of redundancy (this is especially important for #1815):

  • block parts are the same thing as entire block, only chunked
  • commit is already included in block (and block parts)
  • block meta is also quite redundant

For reading speed (if we would like to replace KV based store with file store) it might make sense, for archiving, redundancy should be minimized. From performance perspective it would be interesting to see how it would work if only raw block is stored.

@adlerjohn
Copy link
Member

For reading speed (if we would like to replace KV based store with file store) it might make sense, for archiving, redundancy should be minimized. From performance perspective it would be interesting to see how it would work if only raw block is stored.

Part of #1815 is the intuition that recent and non-recent blocks might have different performance needs.

@yarikbratashchuk yarikbratashchuk force-pushed the yarik/filebased-blockstore branch from d001305 to 4824028 Compare May 14, 2025 15:16
@yarikbratashchuk yarikbratashchuk force-pushed the yarik/filebased-blockstore branch from 4824028 to 3b68255 Compare May 14, 2025 16:44
@evan-forbes
Copy link
Member

evan-forbes commented May 14, 2025

Differences are smaller (~3x vs ~10x for SaveBlock), but still significant.

3x is more of what I would expect, and seems doable for an "ancient blockstore"

we could make an even larger tradeoff if we compressed the blocks

nice digging @tzdybal

@yarikbratashchuk
Copy link
Contributor Author

@tzdybal i'm still getting some crazy SaveBlock times. What would you suggest for improvemt?

@yarikbratashchuk yarikbratashchuk requested a review from tzdybal May 15, 2025 18:54
Copy link
Member

@tzdybal tzdybal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I started experimenting by increasing the block size by setting number of txs to 500, 1000, etc. This seemingly made results almost even for both implementations. But the test results are also skewed by generation of test data - it's quite significant. It's worth adding b.StopTimer() before createTestingBlock call and b.StartTimer() after.

Also, the transactions are very small which makes resulting blocks also very small. They probably won't even execute some of the fancy logic related to buffering. Another quick change - using Tx%01000d as format makes blocks way bigger and this again changes the results significantly. But this is also very specific data - it's entropy is very low, and compression rate is very high (~20x), which may impact the results.

In general, I think test blocks should be created in a way simulating something real - we can take a look at average block size on mainnet, data should probably be randomized to some extend (to introduce more entropy and make compression less effective).

To make sure that time measurement is correct, you also need to ensure that closing of DB is included in time measurement - otherwise, data might still be in buffers, not flushed to disk, which impacts results.

I would also opt for testing multiple blocks per benchmark "op" - for example:

  1. Start Timer
  2. Save 1000 blocks.
  3. Close DB to flush/sync DB to disk.
  4. Stop timer.

@yarikbratashchuk yarikbratashchuk force-pushed the yarik/filebased-blockstore branch from 2720e86 to 9021048 Compare May 16, 2025 14:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Needs Triage
Development

Successfully merging this pull request may close these issues.

Use file-based block store
5 participants