feat: file-system based block storage #1825

yarikbratashchuk · 2025-05-08T21:19:39Z

WIP

evan-forbes · 2025-05-12T15:20:37Z

to move conversations from the sprint here:
we saw a larger than expected decrease in performance. the flup is to get a better idea of where, why, and how much then we can make a decision on if we want to move forward with this feature.

@adlerjohn this PR isn't ready for review, but is it okay for the performance to be 5-10x lower? If not what overhead would be acceptable to move forward with the feature?

ref #1815

adlerjohn · 2025-05-12T15:25:07Z

If this PR is coupled with #1815, and recent blocks use the current blockstore while ancient blocks are migrated to the file-based store, then the performance degradation shouldn't matter much since it's just for ancient blocks? Not sure though.

yarikbratashchuk · 2025-05-12T19:51:54Z

Benchmark comparison against existing kv store implementation

Operation	Implementation	Ops/sec	Time/op (ns)	Memory/op (B)	Allocs/op
SaveBlock	File	2,756	47,716	19,234	297
SaveBlock	DB	23,280	45,117	16,309	257
SaveBlockWithExtendedCommit	File	3,188	438,360	19,242	297
SaveBlockWithExtendedCommit	DB	28,812	42,884	16,311	257
LoadBlock	File	20,575	55,799	6,445	73
LoadBlock	DB	88,410	12,127	5,578	67
LoadBlockMeta	File	41,584	24,547	2,394	26
LoadBlockMeta	DB	305,709	3,299	1,432	20
LoadBlockPart	File	97,412	15,113	1,711	14
LoadBlockPart	DB	780,604	1,680	736	8
LoadBlockCommit	File	42,385,448	28.45	0	0
LoadBlockCommit	DB	5,927,785	193.8	128	1
LoadBlockExtendedCommit	File	43,881,190	26.45	0	0
LoadBlockExtendedCommit	DB	5,420,204	184.5	240	2
PruneBlocks	File	3,772,339	350.2	80	2
PruneBlocks	DB	4,488,566	417.7	81	2
DeleteLatestBlock	File	7,120,083	152.8	40	2
DeleteLatestBlock	DB	263,006	4,026	1,262	24

adlerjohn · 2025-05-13T02:36:25Z

Oh wow, interesting stats. First, we definitely shouldn't merge this PR as-is.

On to the results. It looks like saving and loading blocks takes longer, but deleting blocks is much faster. That second part is a good sign! I'd guess with some optimizations, the time to save and load blocks can be brought down significantly. There's almost certainly some low-hanging unoptimal stuff there.

tzdybal · 2025-05-14T13:17:57Z

As we can assume SSD operation, we probably should make writes more parallel (each file in separate goroutine?) - another bonus is parallelization of serialization process for better CPU utilization. Depending on number of files and sized this might result in better performance.

tzdybal · 2025-05-14T13:37:54Z

Out of curiosity I ran the benchmark - results below:

BenchmarkBlockStore_SaveBlock
BenchmarkBlockStore_SaveBlock/file
BenchmarkBlockStore_SaveBlock/file-16                 	    9576	    128822 ns/op
BenchmarkBlockStore_SaveBlock/db
BenchmarkBlockStore_SaveBlock/db-16                   	   29248	     41971 ns/op
BenchmarkBlockStore_SaveBlockWithExtendedCommit
BenchmarkBlockStore_SaveBlockWithExtendedCommit/file
BenchmarkBlockStore_SaveBlockWithExtendedCommit/file-16         	    9808	    122564 ns/op
BenchmarkBlockStore_SaveBlockWithExtendedCommit/db
BenchmarkBlockStore_SaveBlockWithExtendedCommit/db-16           	   28830	     41520 ns/op
BenchmarkBlockStore_LoadBlock
BenchmarkBlockStore_LoadBlock/file
BenchmarkBlockStore_LoadBlock/file-16                           	   67908	     17675 ns/op
BenchmarkBlockStore_LoadBlock/db
BenchmarkBlockStore_LoadBlock/db-16                             	  208260	      5553 ns/op
BenchmarkBlockStore_LoadBlockMeta
BenchmarkBlockStore_LoadBlockMeta/file
BenchmarkBlockStore_LoadBlockMeta/file-16                       	  151254	      7661 ns/op
BenchmarkBlockStore_LoadBlockMeta/db
BenchmarkBlockStore_LoadBlockMeta/db-16                         	  763065	      1546 ns/op
BenchmarkBlockStore_LoadBlockPart
BenchmarkBlockStore_LoadBlockPart/file
BenchmarkBlockStore_LoadBlockPart/file-16                       	  190332	      6447 ns/op
BenchmarkBlockStore_LoadBlockPart/db
BenchmarkBlockStore_LoadBlockPart/db-16                         	 1606755	       737.9 ns/op
BenchmarkBlockStore_LoadBlockCommit
BenchmarkBlockStore_LoadBlockCommit/file
BenchmarkBlockStore_LoadBlockCommit/file-16                     	27788437	        42.28 ns/op
BenchmarkBlockStore_LoadBlockCommit/db
BenchmarkBlockStore_LoadBlockCommit/db-16                       	11688376	       102.8 ns/op
BenchmarkBlockStore_LoadBlockExtendedCommit
BenchmarkBlockStore_LoadBlockExtendedCommit/file
BenchmarkBlockStore_LoadBlockExtendedCommit/file-16             	27693708	        41.11 ns/op
BenchmarkBlockStore_LoadBlockExtendedCommit/db
BenchmarkBlockStore_LoadBlockExtendedCommit/db-16               	 8139055	       148.1 ns/op
BenchmarkBlockStore_PruneBlocks
BenchmarkBlockStore_PruneBlocks/file
BenchmarkBlockStore_PruneBlocks/file-16                         	 4628271	       242.3 ns/op
BenchmarkBlockStore_PruneBlocks/db
BenchmarkBlockStore_PruneBlocks/db-16                           	 4835184	       250.0 ns/op
BenchmarkBlockStore_DeleteLatestBlock
BenchmarkBlockStore_DeleteLatestBlock/file
BenchmarkBlockStore_DeleteLatestBlock/file-16                   	 9638091	       123.4 ns/op
BenchmarkBlockStore_DeleteLatestBlock/db
BenchmarkBlockStore_DeleteLatestBlock/db-16                     	  587707	      2064 ns/op
PASS

I would add: NVMe, XFS.

Differences are smaller (~3x vs ~10x for SaveBlock), but still significant.

tzdybal · 2025-05-14T13:40:05Z

store/bench_test.go

+// setupBlockStore creates a new block store for benchmarking
+func setupBlockStore(b *testing.B, storeType string) (sm.State, interface{}, func()) {
+	config := test.ResetTestRoot("block_store_bench")
+	stateStore := sm.NewStore(dbm.NewMemDB(), sm.StoreOptions{


For this test file based DB should be used, otherwise we're comparing saving to files vs saving to memory.

Wondertan · 2025-05-14T13:49:46Z

We should not expect reads and writes to be faster with file based storage. Opening FD takes a while. This is one of the reasons DBs exists and prefer to open a single/limited set of files, like SQLite

tzdybal · 2025-05-14T14:36:34Z

I agree that we shouldn't expect files to be much faster than DB - there are overheads from OS and filesystem, and from DB engine (which also need to use OS and filesystem underneath).
To some extent the benchmark tests implementation of some fancy trees in DB/KV engine vs other fancy trees in filesystem 🌲

Very good read from SQLite - also reminds to keep internal fragmentation in mind.

tzdybal · 2025-05-14T14:48:22Z

And one more concern - it seems that there is a lot of redundancy (this is especially important for #1815):

block parts are the same thing as entire block, only chunked
commit is already included in block (and block parts)
block meta is also quite redundant

For reading speed (if we would like to replace KV based store with file store) it might make sense, for archiving, redundancy should be minimized. From performance perspective it would be interesting to see how it would work if only raw block is stored.

adlerjohn · 2025-05-14T14:55:57Z

For reading speed (if we would like to replace KV based store with file store) it might make sense, for archiving, redundancy should be minimized. From performance perspective it would be interesting to see how it would work if only raw block is stored.

Part of #1815 is the intuition that recent and non-recent blocks might have different performance needs.

evan-forbes · 2025-05-14T23:37:21Z

Differences are smaller (~3x vs ~10x for SaveBlock), but still significant.

3x is more of what I would expect, and seems doable for an "ancient blockstore"

we could make an even larger tradeoff if we compressed the blocks

nice digging @tzdybal

yarikbratashchuk · 2025-05-15T18:49:14Z

@tzdybal i'm still getting some crazy SaveBlock times. What would you suggest for improvemt?

tzdybal

I started experimenting by increasing the block size by setting number of txs to 500, 1000, etc. This seemingly made results almost even for both implementations. But the test results are also skewed by generation of test data - it's quite significant. It's worth adding b.StopTimer() before createTestingBlock call and b.StartTimer() after.

Also, the transactions are very small which makes resulting blocks also very small. They probably won't even execute some of the fancy logic related to buffering. Another quick change - using Tx%01000d as format makes blocks way bigger and this again changes the results significantly. But this is also very specific data - it's entropy is very low, and compression rate is very high (~20x), which may impact the results.

In general, I think test blocks should be created in a way simulating something real - we can take a look at average block size on mainnet, data should probably be randomized to some extend (to introduce more entropy and make compression less effective).

To make sure that time measurement is correct, you also need to ensure that closing of DB is included in time measurement - otherwise, data might still be in buffers, not flushed to disk, which impacts results.

I would also opt for testing multiple blocks per benchmark "op" - for example:

Start Timer
Save 1000 blocks.
Close DB to flush/sync DB to disk.
Stop timer.

store/bench_test.go

yarikbratashchuk self-assigned this May 8, 2025

yarikbratashchuk added storage WS: Maintenance 🔧 labels May 8, 2025

yarikbratashchuk added this to core/app May 8, 2025

github-project-automation bot moved this to Needs Triage in core/app May 8, 2025

yarikbratashchuk changed the title ~~feat: Implement FileBlockStore for file-system based block storage~~ feat: file-system based block storage May 8, 2025

yarikbratashchuk force-pushed the yarik/filebased-blockstore branch 4 times, most recently from f75a9ce to 7390d94 Compare May 9, 2025 09:33

feat: implement FileBlockStore for file-system based block storage

f31baa7

yarikbratashchuk force-pushed the yarik/filebased-blockstore branch from 7390d94 to f31baa7 Compare May 12, 2025 14:21

tzdybal requested changes May 14, 2025

View reviewed changes

yarikbratashchuk force-pushed the yarik/filebased-blockstore branch from d001305 to 4824028 Compare May 14, 2025 15:16

Optimizations, addressing comments

3b68255

yarikbratashchuk force-pushed the yarik/filebased-blockstore branch from 4824028 to 3b68255 Compare May 14, 2025 16:44

yarikbratashchuk and others added 4 commits May 14, 2025 18:45

Merge branch 'v0.38.x-celestia' into yarik/filebased-blockstore

5c2d908

Address vulncheck suggestions

7e8f87d

Dont use /tmp for benchmarks

095983f

Refactor bench tests

273dc14

yarikbratashchuk added 3 commits May 14, 2025 18:44

Use kv store for x hash->height index

209ae32

Fix lint issue

5ac8319

Some prealloc optimizations

a45cf2d

yarikbratashchuk requested a review from tzdybal May 15, 2025 18:54

tzdybal requested changes May 15, 2025

View reviewed changes

store/bench_test.go Outdated Show resolved Hide resolved

Update benchmarks

9021048

yarikbratashchuk force-pushed the yarik/filebased-blockstore branch from 2720e86 to 9021048 Compare May 16, 2025 14:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: file-system based block storage #1825

feat: file-system based block storage #1825

Uh oh!

yarikbratashchuk commented May 8, 2025 •

edited

Loading

Uh oh!

evan-forbes commented May 12, 2025

Uh oh!

adlerjohn commented May 12, 2025

Uh oh!

yarikbratashchuk commented May 12, 2025

Uh oh!

adlerjohn commented May 13, 2025

Uh oh!

tzdybal commented May 14, 2025 •

edited

Loading

Uh oh!

tzdybal commented May 14, 2025

Uh oh!

tzdybal May 14, 2025

Uh oh!

Wondertan commented May 14, 2025

Uh oh!

tzdybal commented May 14, 2025

Uh oh!

tzdybal commented May 14, 2025

Uh oh!

adlerjohn commented May 14, 2025

Uh oh!

evan-forbes commented May 14, 2025 •

edited

Loading

Uh oh!

yarikbratashchuk commented May 15, 2025

Uh oh!

tzdybal left a comment

Uh oh!

Uh oh!

Uh oh!

feat: file-system based block storage #1825

Are you sure you want to change the base?

feat: file-system based block storage #1825

Uh oh!

Conversation

yarikbratashchuk commented May 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

evan-forbes commented May 12, 2025

Uh oh!

adlerjohn commented May 12, 2025

Uh oh!

yarikbratashchuk commented May 12, 2025

Uh oh!

adlerjohn commented May 13, 2025

Uh oh!

tzdybal commented May 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tzdybal commented May 14, 2025

Uh oh!

tzdybal May 14, 2025

Choose a reason for hiding this comment

Uh oh!

Wondertan commented May 14, 2025

Uh oh!

tzdybal commented May 14, 2025

Uh oh!

tzdybal commented May 14, 2025

Uh oh!

adlerjohn commented May 14, 2025

Uh oh!

evan-forbes commented May 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yarikbratashchuk commented May 15, 2025

Uh oh!

tzdybal left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

yarikbratashchuk commented May 8, 2025 •

edited

Loading

tzdybal commented May 14, 2025 •

edited

Loading

evan-forbes commented May 14, 2025 •

edited

Loading