Skip to content
Draft
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
48 changes: 48 additions & 0 deletions cmd/nitro/nitro.go
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@ import (
"path/filepath"
"reflect"
"strings"
"sync"
"syscall"
"time"

Expand All @@ -36,6 +37,7 @@ import (
_ "github.com/ethereum/go-ethereum/eth/tracers/js"
_ "github.com/ethereum/go-ethereum/eth/tracers/native"
"github.com/ethereum/go-ethereum/ethclient"
"github.com/ethereum/go-ethereum/ethdb"
"github.com/ethereum/go-ethereum/graphql"
"github.com/ethereum/go-ethereum/log"
"github.com/ethereum/go-ethereum/metrics"
Expand Down Expand Up @@ -641,6 +643,11 @@ func mainImpl() int {
deferFuncs = []func(){func() { currentNode.StopAndWait() }}
}

// Live db snapshot creation is only supported on archive nodes
if nodeConfig.Execution.Caching.Archive {
go liveDBSnapshotter(ctx, chainDb, arbDb, execNode.ExecEngine.CreateBlocksMutex(), func() string { return liveNodeConfig.Get().SnapshotDir })
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that:

  • we should use stopwaiter pattern for the liveDBSnapshotter
  • it might be nice to have a config option to disable (not start) the snapshotter, eg. if we are running sequencer, to be extra safe
  • we should be able to support also full nodes (non archive), I am describing it more in the comment for liveDBSnapshotter

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Want a config option to enable the snapshotter, off by default.

}

sigint := make(chan os.Signal, 1)
signal.Notify(sigint, os.Interrupt, syscall.SIGTERM)

Expand Down Expand Up @@ -674,6 +681,43 @@ func mainImpl() int {
return 0
}

func liveDBSnapshotter(ctx context.Context, chainDb, arbDb ethdb.Database, createBlocksMutex *sync.Mutex, snapshotDirGetter func() string) {
sigusr2 := make(chan os.Signal, 1)
signal.Notify(sigusr2, syscall.SIGUSR2)

for {
select {
case <-ctx.Done():
return
case <-sigusr2:
log.Info("Live databases snapshot creation triggered by SIGUSR2")
}

snapshotDir := snapshotDirGetter()
if snapshotDir == "" {
log.Error("Aborting live databases snapshot creation as destination directory is empty, try updating --snapshot-dir in the config file")
continue
}

createBlocksMutex.Lock()
Copy link
Contributor

@magicxyyz magicxyyz Jan 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we could rearrange order of things a bit and patch geth a bit we could also support a non archive node (what I believe would be main use case for the db snapshoting).

  1. Instead of triggering the snapshot here, we could schedule a snapshot after next block is created. We could call e.g. execNode.ScheduleDBSnapshot, so we wouldn't need to have access to createBlockMutex or any internals of ExecutionNode (that probably will be especially important for execution split).
  2. In ExecutionEngine.appendBlock if a snapshot was scheduled, we could trigger the snapshot after s.bc.WriteBlockAndSetHeadWithTime.

To support full nodes (non archive) we need to make sure that the state for the block written with WriteBlockAndSetHeadWithTime is committed to disk. To do that we need to force commit the state. It could be done e.g. with a ForceTriedbCommitHook hook that I added in snap sync draft: https://github.com/OffchainLabs/go-ethereum/pull/280/files#diff-53d5f4b8a536ec2a8c8c92bf70b8268f1d77ad77e9f316e6f68a2bcae5303215

The hook would be set to a function created in gethexec scope and that would have access to ExecutionEngine, something like:

hook := func() bool {
    return execEngine.shouldForceCommitState()
}

func (e *ExecutionEngine) shouldForceCommitState() {
    return e.forceCommitState
}

func (e *ExecutionEngine) ScheduleDBSnapshot() {
    e.dbSnapshotScheduled.Store(true)
}

func (e *ExecutionEngine) appendBlock() error {
...
    snapshotScheduled := e.dbSnapshotScheduled.Load()
    if  snapshotScheduled {
        e.forceCommitState = true
    }
    status, err := s.bc.WriteBlockAndSetHeadWithTime(...)
    if err != nil {
        return err
    }
    ...
    if snapshotScheduled {
         e.forceCommitState = false
         chainDb.CreateDBSnapshot(snapshotDir)
    } 
...
}

That setting of the hook can be done similarly as in SnapHelper PR draft: https://github.com/OffchainLabs/nitro/pull/2122/files#diff-19d6494fe5ff01c95bfdd1e4af6d31d75207d21743af80f57f0cf93848a32e3e

Having written that, I am no longer sure if that's that straightforward as I thought when starting this comment 😓 but should be doable :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to go that way, I can split out simplified ForceTriedbCommitHook from my draft PRs, so it can be merged in earlier and used here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, this seems like a great idea and will see if it works out for snapshotting a fullnode.
Sorry for the delay!

Copy link
Contributor

@magicxyyz magicxyyz Apr 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I think of it now, as you're already acquiring block creation mutex, you should be able to commit state similarly to how ExecutionEngine.Maintenance does - it calls newly added go-ethereum/core.BlockChain.FlushTrieDB

FlushTrieDB doesn't commit state root for head block and snapshot root, but for the oldest block in BlockChain.triegc (next one to be garbage collected). The node started from such a snapshot might be able to recover, but that would need a little bit more attention.

What might be best option, is to add new method to core.BlockChain that will persist the same roots as when the blockchain is stopped and "Ensure that the entirety of the state snapshot is journaled to disk" by calling bc.snaps.Journal: https://github.com/OffchainLabs/go-ethereum/blob/e6c8bea35d519098cf7cc9c0d3765aef9ab72cbb/core/blockchain.go#L1202-L1257
Then the live snapshot should have a database state similar to the one after stopping the blockchain, but without actually doing so :)

What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense to me, Im currently testing your last idea out i.e to mimic what the blockchain during a stop without clearing the underlying data that a stop would do

Copy link
Contributor

@magicxyyz magicxyyz Apr 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

imo you don't have to worry about the side effects of committing tries from triedb (hashdb) to disk:

  • the recently written trie nodes will be cached in cleans cache
  • the hashdb referencing/dereferencing mechanism should work just fine with the commits

the main area to be careful and explore seem to be the operations around snapshots e.g. bc.snaps.Journal as the method is meant to be used during shutdown, according to the comment:

// Journal commits an entire diff hierarchy to disk into a single journal entry.
// This is meant to be used during shutdown to persist the snapshot without
// flattening everything down (bad for reorgs).
//
// The method returns the root hash of the base layer that needs to be persisted
// to disk as a trie too to allow continuing any pending generation op.
func (t *Tree) Journal(root common.Hash) (common.Hash, error) {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The approach worked for database sizes from thousand blocks to about 3.5 million! Marking the PR as ready for review and for further testing with up-to-date arb1 and arbsepolia databases

Copy link
Contributor

@magicxyyz magicxyyz Apr 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one thing that I think might need looking more into is the unclean shutdown log line e.g. from arb1 started from the snapshot:

WARN [04-14|13:31:56.376] Unclean shutdown detected                booted=2025-04-14T13:28:39+0530 age=3m17s

I think that might be because on normal shutdown latest unclean shutdown marker is removed from db and because we don't stop the node the copy has the unclean shutdown marker. The boot time from the marker seems to match the time when the original node might have been started.

Here's how the shutdown tracker works: https://github.com/OffchainLabs/go-ethereum/blob/master/internal/shutdowncheck/shutdown_tracker.go

  • MarkStartup is called on startup - checks if there are some unclean shutdown markers from previous boots and pushes current boot timestamp to unclean shutdown markers list
  • Stop is called after node is fully stopped - pops latest timestamp from markers

We probably need to somehow rawdb.PopUncleanShutdownMarker from the snapshot.
That might be tricky. The simplest solution would be to pop and then after live snapshot pushing back the marker. That requires a bit of thought what happens if node crashes when doing the live snapshot.

My initial thinking is that if we first save everything needed to disk then pop the marker we should be o.k. unless crashing during pebble Checkpoint corrupts the db. We might need to figure out another way of cleaning up the marker if that's not safe enough.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems very dicey, as you pointed out a node crashing during checkpointing would definitely want to have this marker so we cant pop it before snapshotting to avoid appearing in our copied database.

log.Info("Beginning snapshot creation for l2chaindata, ancient and wasm databases")
err := chainDb.CreateDBSnapshot(snapshotDir)
createBlocksMutex.Unlock()
if err != nil {
log.Error("Snapshot creation for l2chaindata, ancient and wasm databases failed", "err", err)
continue
}
log.Info("Live snapshot of l2chaindata, ancient and wasm databases were successfully created")

log.Info("Beginning snapshot creation for arbitrumdata database")
if err := arbDb.CreateDBSnapshot(snapshotDir); err != nil {
log.Error("Snapshot creation for arbitrumdata database failed", "err", err)
} else {
log.Info("Live snapshot of arbitrumdata database was successfully created")
}
}
}

type NodeConfig struct {
Conf genericconf.ConfConfig `koanf:"conf" reload:"hot"`
Node arbnode.Config `koanf:"node" reload:"hot"`
Expand All @@ -697,6 +741,7 @@ type NodeConfig struct {
Init conf.InitConfig `koanf:"init"`
Rpc genericconf.RpcConfig `koanf:"rpc"`
BlocksReExecutor blocksreexecutor.Config `koanf:"blocks-reexecutor"`
SnapshotDir string `koanf:"snapshot-dir" reload:"hot"`
}

var NodeConfigDefault = NodeConfig{
Expand All @@ -722,6 +767,7 @@ var NodeConfigDefault = NodeConfig{
PProf: false,
PprofCfg: genericconf.PProfDefault,
BlocksReExecutor: blocksreexecutor.DefaultConfig,
SnapshotDir: "",
}

func NodeConfigAddOptions(f *flag.FlagSet) {
Expand All @@ -748,6 +794,8 @@ func NodeConfigAddOptions(f *flag.FlagSet) {
conf.InitConfigAddOptions("init", f)
genericconf.RpcConfigAddOptions("rpc", f)
blocksreexecutor.ConfigAddOptions("blocks-reexecutor", f)

f.String("snapshot-dir", NodeConfigDefault.SnapshotDir, "directory in which snapshot of databases would be stored")
}

func (c *NodeConfig) ResolveDirectoryNames() error {
Expand Down
4 changes: 4 additions & 0 deletions cmd/replay/db.go
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,10 @@ import (

type PreimageDb struct{}

func (db PreimageDb) CreateDBSnapshot(dir string) error {
return errors.New("createDBSnapshot method is not supported by PreimageDb")
}

func (db PreimageDb) Has(key []byte) (bool, error) {
if len(key) != 32 {
return false, nil
Expand Down
13 changes: 11 additions & 2 deletions execution/gethexec/api.go
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ import (
"math/big"
"sync"
"sync/atomic"
"syscall"
"time"

"github.com/ethereum/go-ethereum/arbitrum"
Expand Down Expand Up @@ -40,10 +41,18 @@ type ArbDebugAPI struct {
blockchain *core.BlockChain
blockRangeBound uint64
timeoutQueueBound uint64
isArchiveNode bool
}

func NewArbDebugAPI(blockchain *core.BlockChain, blockRangeBound uint64, timeoutQueueBound uint64) *ArbDebugAPI {
return &ArbDebugAPI{blockchain, blockRangeBound, timeoutQueueBound}
func NewArbDebugAPI(blockchain *core.BlockChain, blockRangeBound uint64, timeoutQueueBound uint64, isArchiveNode bool) *ArbDebugAPI {
return &ArbDebugAPI{blockchain, blockRangeBound, timeoutQueueBound, isArchiveNode}
}

func (api *ArbDebugAPI) CreateDBSnapshot(ctx context.Context) error {
if !api.isArchiveNode {
return errors.New("live database snapshot creation is not available for non-archive nodes")
}
return syscall.Kill(syscall.Getpid(), syscall.SIGUSR2)
}

type PricingModelHistory struct {
Expand Down
4 changes: 4 additions & 0 deletions execution/gethexec/executionengine.go
Original file line number Diff line number Diff line change
Expand Up @@ -112,6 +112,10 @@ func NewExecutionEngine(bc *core.BlockChain) (*ExecutionEngine, error) {
}, nil
}

func (s *ExecutionEngine) CreateBlocksMutex() *sync.Mutex {
return &s.createBlocksMutex
}

func (s *ExecutionEngine) backlogCallDataUnits() uint64 {
s.cachedL1PriceData.mutex.RLock()
defer s.cachedL1PriceData.mutex.RUnlock()
Expand Down
1 change: 1 addition & 0 deletions execution/gethexec/node.go
Original file line number Diff line number Diff line change
Expand Up @@ -277,6 +277,7 @@ func CreateExecutionNode(
l2BlockChain,
config.RPC.ArbDebug.BlockRangeBound,
config.RPC.ArbDebug.TimeoutQueueBound,
config.Caching.Archive,
),
Public: false,
})
Expand Down
Loading