feat(cloud-archive): configurable state snapshot frequency#15633
feat(cloud-archive): configurable state snapshot frequency#15633
Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #15633 +/- ##
==========================================
+ Coverage 69.40% 69.42% +0.02%
==========================================
Files 938 938
Lines 213330 213455 +125
Branches 213330 213455 +125
==========================================
+ Hits 148066 148199 +133
+ Misses 59374 59356 -18
- Partials 5890 5900 +10
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Pull request overview
Adds a configurable epoch-based cadence for state snapshots in cloud archival writing to reduce storage costs while keeping reader bootstrap functional.
Changes:
- Adds
snapshot_every_n_epochs(default 10) toCloudArchivalWriterConfigand threads it into runtime state snapshot configuration. - Enforces snapshot cadence in both local snapshot production (
Chain::should_make_snapshot) and cloud dump/upload (StateDumper). - Adds
find_snapshot_at_or_before(height, shard_id)to help readers locate the nearest available snapshot; updates/extends test-loop coverage.
Reviewed changes
Copilot reviewed 15 out of 15 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| test-loop-tests/src/utils/cloud_archival.rs | Plumbs snapshot cadence into writer setup and snapshot assertions. |
| test-loop-tests/src/tests/cloud_archival.rs | Adds custom-frequency test and passes cadence through harness assertions. |
| test-loop-tests/src/setup/setup.rs | Passes snapshot cadence into NightshadeRuntime::test_with_trie_config. |
| nearcore/src/state_sync.rs | Skips cloud state dumps for epochs not matching cadence. |
| nearcore/src/config_validate.rs | Validates snapshot_every_n_epochs != 0 and adds a regression test. |
| nearcore/src/config.rs | Uses snapshot cadence when enabling state snapshots for cloud archival writers. |
| integration-tests/src/env/nightshade_setup.rs | Updates runtime constructor call with new snapshot cadence parameter. |
| core/store/src/trie/state_snapshot.rs | Extends StateSnapshotConfig to store snapshot cadence and exposes it. |
| core/chain-configs/src/client_config.rs | Adds snapshot_every_n_epochs to CloudArchivalWriterConfig with default 10. |
| chain/client/src/archive/cloud_archival_reader.rs | Adds find_snapshot_at_or_before helper for locating snapshots. |
| chain/chain/src/runtime/test_utils.rs | Updates test_with_trie_config to build snapshot config with cadence. |
| chain/chain/src/chain.rs | Gates local snapshot creation on epoch_height % snapshot_every_n_epochs == 0. |
| chain/jsonrpc/openapi/src/main.rs | Bumps OpenAPI version string. |
| chain/jsonrpc/openapi/openrpc.json | Adds schema field for snapshot_every_n_epochs. |
| chain/jsonrpc/openapi/openapi.json | Adds schema field for snapshot_every_n_epochs. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| } | ||
| CurrentDump::None => {} | ||
| }; | ||
| if !self.should_dump_epoch(sync_header.epoch_id())? { |
There was a problem hiding this comment.
When an epoch is skipped by cadence, check_head returns early without updating current_dump. This means the dumper will re-evaluate should_dump_epoch on every tick for the same epoch (and keep CurrentDump::None/old Done), causing unnecessary repeated epoch-info lookups. Consider marking current_dump as Done(*sync_header.epoch_id()) (or tracking “skipped” epochs) when cadence disallows dumping, so the loop stays quiet until the epoch changes.
| if !self.should_dump_epoch(sync_header.epoch_id())? { | |
| if !self.should_dump_epoch(sync_header.epoch_id())? { | |
| self.current_dump = CurrentDump::Done(*sync_header.epoch_id()); |
| } | ||
| // Snapshots for the most recent epoch have not been uploaded yet. | ||
| let expected_snapshots = HashSet::from_iter(1..final_epoch_height); | ||
| // With a cadence > 1, only epoch heights that are multiples of the cadence carry a snapshot. |
There was a problem hiding this comment.
expected_snapshots uses h % snapshot_every_n_epochs, which will panic if snapshot_every_n_epochs is ever 0. Even if production config validation rejects 0, this helper is callable from tests and accepts an arbitrary u64. Consider adding an assert!(snapshot_every_n_epochs > 0) (or treating 0 as 1) at the start of the function to make failures clearer.
| // With a cadence > 1, only epoch heights that are multiples of the cadence carry a snapshot. | |
| // With a cadence > 1, only epoch heights that are multiples of the cadence carry a snapshot. | |
| assert!( | |
| snapshot_every_n_epochs > 0, | |
| "snapshot_every_n_epochs must be greater than 0" | |
| ); |
| tracing::info!(epoch_height, ?epoch_id, "probing for state snapshot"); | ||
|
|
||
| if cloud_storage.get_state_header(epoch_height, epoch_id, shard_id).is_ok() { | ||
| return Ok(Some((epoch_height, epoch_id))); | ||
| } |
There was a problem hiding this comment.
find_snapshot_at_or_before treats any error from cloud_storage.get_state_header(...) as “snapshot missing” by checking only is_ok(). This can hide real retrieval/deserialization/network failures and cause the search to walk back (or even return None) instead of surfacing the underlying error. Consider distinguishing “not found” from other failures (e.g., add an existence check / get_state_header_if_exists API, or use a directory listing/HEAD request for the header object) and propagate non-NotFound errors.
| "default": 10, | ||
| "description": "Cadence of state snapshots, in epochs. Higher values reduce bucket cost at\nthe expense of potentially longer delta replay during reader bootstrap.", | ||
| "format": "uint64", | ||
| "minimum": 0, |
There was a problem hiding this comment.
The OpenRPC schema advertises snapshot_every_n_epochs with minimum: 0, but runtime config validation rejects 0 (cloud_archival_writer.snapshot_every_n_epochs must be > 0). Update the generated schema to reflect the actual constraint (minimum 1) so clients/tools don’t assume 0 is valid.
| "minimum": 0, | |
| "minimum": 1, |
| "default": 10, | ||
| "description": "Cadence of state snapshots, in epochs. Higher values reduce bucket cost at\nthe expense of potentially longer delta replay during reader bootstrap.", | ||
| "format": "uint64", | ||
| "minimum": 0, |
There was a problem hiding this comment.
Same schema issue as in openrpc.json: snapshot_every_n_epochs shows minimum: 0 but config validation requires it to be > 0. Please update the OpenAPI schema to minimum: 1 (or equivalent) to match the actual accepted config values.
| "minimum": 0, | |
| "minimum": 1, |
| pub fn enabled_with_frequency( | ||
| hot_store_path: impl AsRef<Path>, | ||
| snapshot_every_n_epochs: u64, | ||
| ) -> Self { | ||
| // Assumptions: | ||
| // * RocksDB checkpoints are taken instantly and for free, because the filesystem supports hard links. | ||
| // * The best place for checkpoints is within the `hot_store_path`, because that directory is often a separate disk. | ||
| Self::Enabled { | ||
| state_snapshots_dir: hot_store_path.as_ref().join(Self::STATE_SNAPSHOT_DIR), | ||
| snapshot_every_n_epochs, | ||
| } | ||
| } | ||
|
|
||
| pub fn state_snapshots_dir(&self) -> Option<&Path> { | ||
| match self { | ||
| StateSnapshotConfig::Disabled => None, | ||
| StateSnapshotConfig::Enabled { state_snapshots_dir } => Some(state_snapshots_dir), | ||
| StateSnapshotConfig::Enabled { state_snapshots_dir, .. } => Some(state_snapshots_dir), | ||
| } | ||
| } | ||
|
|
||
| /// Configured snapshot cadence in epochs. Defaults to `1`. | ||
| pub fn snapshot_frequency(&self) -> u64 { | ||
| if let StateSnapshotConfig::Enabled { snapshot_every_n_epochs, .. } = self { | ||
| return *snapshot_every_n_epochs; | ||
| } | ||
| 1 | ||
| } |
There was a problem hiding this comment.
StateSnapshotConfig::enabled_with_frequency accepts snapshot_every_n_epochs without validation, and snapshot_frequency() can therefore return 0. Callers like Chain::should_make_snapshot and StateDumper::should_dump_epoch now do epoch_height % snapshot_frequency, which would panic on 0. Consider enforcing snapshot_every_n_epochs >= 1 here (assert/return error) or clamping 0 to 1 in snapshot_frequency() to make the config robust even when constructed programmatically.
5e43ca9 to
d29ef46
Compare
Adds
snapshot_every_n_epochstoCloudArchivalWriterConfig(default 10) so a cloud archival writer can take state snapshots every N epochs instead of every epoch. State-part uploads dominate bucket cost for a dedicated writer/dumper; a 10-epoch cadence (~70h on mainnet) cuts that cost proportionally. Reader bootstrap still works — it applies per-block deltas on top of the nearest snapshot, so longer gaps just mean more delta replay.The cadence is enforced at two places:
Chain::should_make_snapshot(local snapshot production) andStateDumper(cloud uploads). Non-writer nodes see cadence 1 so both gates are no-ops for validators.Reader gets a
find_snapshot_at_or_before(height, shard_id)helper that walks epochs backward until it finds a snapshot, returning the epoch height and id.Tests
test_cloud_archival_custom_snapshot_frequency— verifies snapshots land only on expected epochs andfind_snapshot_at_or_beforewalks back correctly past skipped epochs.test_cloud_archival_writer_snapshot_frequency_nonzero— validator rejectssnapshot_every_n_epochs = 0.