Skip to content

feat(cloud-archive): configurable state snapshot frequency#15633

Open
staffik wants to merge 2 commits intomasterfrom
stafik/cloud-archive/configurable-snapshot-frequency
Open

feat(cloud-archive): configurable state snapshot frequency#15633
staffik wants to merge 2 commits intomasterfrom
stafik/cloud-archive/configurable-snapshot-frequency

Conversation

@staffik
Copy link
Copy Markdown
Contributor

@staffik staffik commented Apr 24, 2026

Adds snapshot_every_n_epochs to CloudArchivalWriterConfig (default 10) so a cloud archival writer can take state snapshots every N epochs instead of every epoch. State-part uploads dominate bucket cost for a dedicated writer/dumper; a 10-epoch cadence (~70h on mainnet) cuts that cost proportionally. Reader bootstrap still works — it applies per-block deltas on top of the nearest snapshot, so longer gaps just mean more delta replay.

The cadence is enforced at two places: Chain::should_make_snapshot (local snapshot production) and StateDumper (cloud uploads). Non-writer nodes see cadence 1 so both gates are no-ops for validators.

Reader gets a find_snapshot_at_or_before(height, shard_id) helper that walks epochs backward until it finds a snapshot, returning the epoch height and id.

Tests

  • test_cloud_archival_custom_snapshot_frequency — verifies snapshots land only on expected epochs and find_snapshot_at_or_before walks back correctly past skipped epochs.
  • test_cloud_archival_writer_snapshot_frequency_nonzero — validator rejects snapshot_every_n_epochs = 0.
  • 11 existing testloop cloud archival tests pass unchanged.

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 24, 2026

Codecov Report

❌ Patch coverage is 84.73282% with 20 lines in your changes missing coverage. Please review.
✅ Project coverage is 69.42%. Comparing base (64e5f75) to head (d29ef46).
⚠️ Report is 1 commits behind head on master.

Files with missing lines Patch % Lines
chain/client/src/archive/cloud_archival_reader.rs 75.75% 2 Missing and 6 partials ⚠️
core/store/src/trie/state_snapshot.rs 69.23% 4 Missing ⚠️
nearcore/src/config.rs 50.00% 4 Missing ⚠️
nearcore/src/state_sync.rs 83.33% 0 Missing and 3 partials ⚠️
chain/chain/src/chain.rs 90.90% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master   #15633      +/-   ##
==========================================
+ Coverage   69.40%   69.42%   +0.02%     
==========================================
  Files         938      938              
  Lines      213330   213455     +125     
  Branches   213330   213455     +125     
==========================================
+ Hits       148066   148199     +133     
+ Misses      59374    59356      -18     
- Partials     5890     5900      +10     
Flag Coverage Δ
pytests-nightly 1.14% <0.00%> (-0.01%) ⬇️
unittests 68.84% <84.73%> (+0.03%) ⬆️
unittests-nightly 68.96% <84.73%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@staffik staffik requested review from Trisfald and Wiezzel April 24, 2026 09:20
@staffik staffik marked this pull request as ready for review April 24, 2026 09:20
@staffik staffik requested a review from frol as a code owner April 24, 2026 09:20
Copilot AI review requested due to automatic review settings April 24, 2026 09:20
@staffik staffik requested a review from a team as a code owner April 24, 2026 09:20
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a configurable epoch-based cadence for state snapshots in cloud archival writing to reduce storage costs while keeping reader bootstrap functional.

Changes:

  • Adds snapshot_every_n_epochs (default 10) to CloudArchivalWriterConfig and threads it into runtime state snapshot configuration.
  • Enforces snapshot cadence in both local snapshot production (Chain::should_make_snapshot) and cloud dump/upload (StateDumper).
  • Adds find_snapshot_at_or_before(height, shard_id) to help readers locate the nearest available snapshot; updates/extends test-loop coverage.

Reviewed changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
test-loop-tests/src/utils/cloud_archival.rs Plumbs snapshot cadence into writer setup and snapshot assertions.
test-loop-tests/src/tests/cloud_archival.rs Adds custom-frequency test and passes cadence through harness assertions.
test-loop-tests/src/setup/setup.rs Passes snapshot cadence into NightshadeRuntime::test_with_trie_config.
nearcore/src/state_sync.rs Skips cloud state dumps for epochs not matching cadence.
nearcore/src/config_validate.rs Validates snapshot_every_n_epochs != 0 and adds a regression test.
nearcore/src/config.rs Uses snapshot cadence when enabling state snapshots for cloud archival writers.
integration-tests/src/env/nightshade_setup.rs Updates runtime constructor call with new snapshot cadence parameter.
core/store/src/trie/state_snapshot.rs Extends StateSnapshotConfig to store snapshot cadence and exposes it.
core/chain-configs/src/client_config.rs Adds snapshot_every_n_epochs to CloudArchivalWriterConfig with default 10.
chain/client/src/archive/cloud_archival_reader.rs Adds find_snapshot_at_or_before helper for locating snapshots.
chain/chain/src/runtime/test_utils.rs Updates test_with_trie_config to build snapshot config with cadence.
chain/chain/src/chain.rs Gates local snapshot creation on epoch_height % snapshot_every_n_epochs == 0.
chain/jsonrpc/openapi/src/main.rs Bumps OpenAPI version string.
chain/jsonrpc/openapi/openrpc.json Adds schema field for snapshot_every_n_epochs.
chain/jsonrpc/openapi/openapi.json Adds schema field for snapshot_every_n_epochs.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

}
CurrentDump::None => {}
};
if !self.should_dump_epoch(sync_header.epoch_id())? {
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When an epoch is skipped by cadence, check_head returns early without updating current_dump. This means the dumper will re-evaluate should_dump_epoch on every tick for the same epoch (and keep CurrentDump::None/old Done), causing unnecessary repeated epoch-info lookups. Consider marking current_dump as Done(*sync_header.epoch_id()) (or tracking “skipped” epochs) when cadence disallows dumping, so the loop stays quiet until the epoch changes.

Suggested change
if !self.should_dump_epoch(sync_header.epoch_id())? {
if !self.should_dump_epoch(sync_header.epoch_id())? {
self.current_dump = CurrentDump::Done(*sync_header.epoch_id());

Copilot uses AI. Check for mistakes.
}
// Snapshots for the most recent epoch have not been uploaded yet.
let expected_snapshots = HashSet::from_iter(1..final_epoch_height);
// With a cadence > 1, only epoch heights that are multiples of the cadence carry a snapshot.
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

expected_snapshots uses h % snapshot_every_n_epochs, which will panic if snapshot_every_n_epochs is ever 0. Even if production config validation rejects 0, this helper is callable from tests and accepts an arbitrary u64. Consider adding an assert!(snapshot_every_n_epochs > 0) (or treating 0 as 1) at the start of the function to make failures clearer.

Suggested change
// With a cadence > 1, only epoch heights that are multiples of the cadence carry a snapshot.
// With a cadence > 1, only epoch heights that are multiples of the cadence carry a snapshot.
assert!(
snapshot_every_n_epochs > 0,
"snapshot_every_n_epochs must be greater than 0"
);

Copilot uses AI. Check for mistakes.
Comment on lines +174 to +178
tracing::info!(epoch_height, ?epoch_id, "probing for state snapshot");

if cloud_storage.get_state_header(epoch_height, epoch_id, shard_id).is_ok() {
return Ok(Some((epoch_height, epoch_id)));
}
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

find_snapshot_at_or_before treats any error from cloud_storage.get_state_header(...) as “snapshot missing” by checking only is_ok(). This can hide real retrieval/deserialization/network failures and cause the search to walk back (or even return None) instead of surfacing the underlying error. Consider distinguishing “not found” from other failures (e.g., add an existence check / get_state_header_if_exists API, or use a directory listing/HEAD request for the header object) and propagate non-NotFound errors.

Copilot uses AI. Check for mistakes.
"default": 10,
"description": "Cadence of state snapshots, in epochs. Higher values reduce bucket cost at\nthe expense of potentially longer delta replay during reader bootstrap.",
"format": "uint64",
"minimum": 0,
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The OpenRPC schema advertises snapshot_every_n_epochs with minimum: 0, but runtime config validation rejects 0 (cloud_archival_writer.snapshot_every_n_epochs must be > 0). Update the generated schema to reflect the actual constraint (minimum 1) so clients/tools don’t assume 0 is valid.

Suggested change
"minimum": 0,
"minimum": 1,

Copilot uses AI. Check for mistakes.
"default": 10,
"description": "Cadence of state snapshots, in epochs. Higher values reduce bucket cost at\nthe expense of potentially longer delta replay during reader bootstrap.",
"format": "uint64",
"minimum": 0,
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same schema issue as in openrpc.json: snapshot_every_n_epochs shows minimum: 0 but config validation requires it to be > 0. Please update the OpenAPI schema to minimum: 1 (or equivalent) to match the actual accepted config values.

Suggested change
"minimum": 0,
"minimum": 1,

Copilot uses AI. Check for mistakes.
Comment on lines +150 to 176
pub fn enabled_with_frequency(
hot_store_path: impl AsRef<Path>,
snapshot_every_n_epochs: u64,
) -> Self {
// Assumptions:
// * RocksDB checkpoints are taken instantly and for free, because the filesystem supports hard links.
// * The best place for checkpoints is within the `hot_store_path`, because that directory is often a separate disk.
Self::Enabled {
state_snapshots_dir: hot_store_path.as_ref().join(Self::STATE_SNAPSHOT_DIR),
snapshot_every_n_epochs,
}
}

pub fn state_snapshots_dir(&self) -> Option<&Path> {
match self {
StateSnapshotConfig::Disabled => None,
StateSnapshotConfig::Enabled { state_snapshots_dir } => Some(state_snapshots_dir),
StateSnapshotConfig::Enabled { state_snapshots_dir, .. } => Some(state_snapshots_dir),
}
}

/// Configured snapshot cadence in epochs. Defaults to `1`.
pub fn snapshot_frequency(&self) -> u64 {
if let StateSnapshotConfig::Enabled { snapshot_every_n_epochs, .. } = self {
return *snapshot_every_n_epochs;
}
1
}
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

StateSnapshotConfig::enabled_with_frequency accepts snapshot_every_n_epochs without validation, and snapshot_frequency() can therefore return 0. Callers like Chain::should_make_snapshot and StateDumper::should_dump_epoch now do epoch_height % snapshot_frequency, which would panic on 0. Consider enforcing snapshot_every_n_epochs >= 1 here (assert/return error) or clamping 0 to 1 in snapshot_frequency() to make the config robust even when constructed programmatically.

Copilot uses AI. Check for mistakes.
@staffik staffik force-pushed the stafik/cloud-archive/configurable-snapshot-frequency branch from 5e43ca9 to d29ef46 Compare April 24, 2026 09:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants