Skip to content

db/snaptype: add cache-busting hash particle to snapshot filenames#19150

Open
anacrolix wants to merge 9 commits intomainfrom
anacrolix/cache-busting-snapshots
Open

db/snaptype: add cache-busting hash particle to snapshot filenames#19150
anacrolix wants to merge 9 commits intomainfrom
anacrolix/cache-busting-snapshots

Conversation

@anacrolix
Copy link
Copy Markdown
Contributor

@anacrolix anacrolix commented Feb 13, 2026

Summary

  • Adds optional cache-busting hash particle to snapshot filenames, placed as a dot-separated component before the file extension (e.g. v1.0-000000-001000-headers.abc123def0.seg)
  • Computes a truncated SHA256 hash of .seg file content after compression and renames the file to include it, ensuring snapshot filenames change when content changes and preventing stale BitTorrent downloads
  • Fully backward compatible: filenames without the particle parse identically to before
  • Includes FileInfo.Hash field, WithHash()/As() hash preservation, construction helpers, and hash-tolerant glob masks
  • Applies content hash after Compress() in all snapshot generation paths: dumpRange, ExtractRange, merge, caplin beacon/blob/state dumps
  • Updates DirtySegment to carry the hash through FileName() and FileInfo()

@anacrolix
Copy link
Copy Markdown
Contributor Author

#19150

@anacrolix anacrolix self-assigned this Mar 4, 2026
@anacrolix anacrolix force-pushed the anacrolix/cache-busting-snapshots branch from 9a059b8 to 5468163 Compare March 6, 2026 00:44
@anacrolix
Copy link
Copy Markdown
Contributor Author

Manually dispatched two additional workflow runs to validate the snapshot filename changes against real-world data:

  • Manifest Check — builds the downloader binary and runs manifest-verify against all 6 production webseed servers (mainnet, bor-mainnet, gnosis, chiado, sepolia, amoy), directly parsing real snapshot filenames from the listings. This verifies the filename parsing changes are compatible with existing production filenames.

  • QA Snapshot Download — runs Erigon against mainnet and exercises the full snapshot pipeline: filename discovery, torrent download, file opening, and indexing with real .seg files.

These don't trigger automatically on this PR since it targets a non-release branch and go.mod wasn't changed.

@anacrolix
Copy link
Copy Markdown
Contributor Author

The 3 failing checks (gnosis-rpc-integ-tests, mainnet-rpc-integ-tests, mainnet-rpc-integ-tests-remote) are unrelated to this PR. They are RPC response comparison tests running against a live reference node and are failing intermittently across multiple unrelated PRs right now (e.g. alex/etl_mmap_34, alex/histoy_table_format_change_34). Re-running until green.

@anacrolix anacrolix marked this pull request as ready for review March 6, 2026 01:49
…enames

Support an optional hash particle in snapshot filenames for cache busting.
The particle sits as a dot-separated component before the file extension:

  V2: v1.0-000000-001000-headers.abc123def0.seg
  V3: v12.13-accounts.100-164.abc123def0.efi

Filenames without the particle parse identically to before. The existing
extension-stripping loop in ParseFileName already handled extra dot
components; this change captures the first one as FileInfo.Hash.

Adds WithHash/As hash preservation, construction helpers, and
hash-tolerant glob masks.
…fter compression

Compute a truncated SHA256 hash of .seg file content after compression
and rename the file to include it as a cache-busting particle. This
ensures snapshot filenames change when content changes, preventing stale
BitTorrent downloads.

Changes:
- Add ApplyContentHash/computeFileHash helpers in db/snaptype/files.go
- Add hash field to DirtySegment, update FileName() and FileInfo()
- Call ApplyContentHash after Compress() in all snapshot generation
  paths: dumpRange, ExtractRange, merge, caplin beacon/blob/state dumps
- Update merge() to return updated FileInfo for correct error cleanup
- Fix FileInfo.As() to strip hash (content-specific per type)
- Fix ReplaceVersionWithMask to match optional hash in glob patterns
@anacrolix anacrolix force-pushed the anacrolix/cache-busting-snapshots branch from 89a48cc to 7c21bf8 Compare March 27, 2026 04:47
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds an optional cache-busting content-hash particle to snapshot filenames and propagates it through snapshot generation and snapshot-sync code paths to avoid stale BitTorrent artifacts when snapshot content changes.

Changes:

  • Added FileInfo.Hash, WithHash(), and helpers/masks to construct and match hashed snapshot filenames.
  • Applied content hashing after Compress() across multiple snapshot generation paths (ExtractRange, dumpRange, merge, Caplin dumps) and carried the hash through DirtySegment.
  • Updated tests to validate parsing/round-tripping of hashed filenames and to tolerate hashed outputs in merge tests.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
db/version/file_version.go Expands version-masked patterns to also match optional hash particles.
db/snaptype/type.go Applies content hash after compression in ExtractRange before index building.
db/snaptype/files.go Adds hash-aware filename helpers, parsing into FileInfo.Hash, and file hashing/rename logic.
db/snaptype/files_test.go Adds unit tests for hash parsing, naming helpers, masks, and ApplyContentHash.
db/snapshotsync/snapshots.go Extends DirtySegment to carry and emit hashed snapshot filenames.
db/snapshotsync/snapshots_test.go Updates merge test to locate merged output via a hash-tolerant glob mask.
db/snapshotsync/merger.go Applies content hash after merge compression and propagates into DirtySegment.
db/snapshotsync/freezeblocks/caplin_snapshots.go Applies content hash after compression for Caplin block/blob dumps.
db/snapshotsync/freezeblocks/block_snapshots.go Applies content hash after compression for execution-layer snapshot dumps.
db/snapshotsync/caplin_state_snapshots.go Applies content hash after compression for Caplin state dumps.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread db/snaptype/files.go
Comment thread db/snapshotsync/merger.go
Comment thread db/version/file_version.go Outdated
Comment thread db/snaptype/files.go
anacrolix and others added 3 commits April 17, 2026 12:30
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread db/snaptype/files.go
Comment on lines 181 to 186
for ext := filepath.Ext(croppedFileName); ext != "" && !strings.Contains(ext, "-"); ext = filepath.Ext(croppedFileName) {
croppedFileName = strings.TrimSuffix(croppedFileName, ext)
if res.Hash == "" {
res.Hash = ext[1:] // strip leading dot
}
}
Copy link

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The hash-extraction loop sets res.Hash to the first stripped dotted suffix unconditionally. This will mis-classify non-hash suffixes (e.g. .torrent in ...ef.torrent.tmp..., or .seg/.idx when parsing .torrent* wrappers) as the content hash. A more robust approach is to only accept a hash particle when the suffix looks like a hex digest (and/or when it is not a known snapshot extension like .seg/.idx/.efi).

Copilot uses AI. Check for mistakes.
Comment thread db/snaptype/type.go
Comment on lines +628 to +632
f, err = ApplyContentHash(f)
if err != nil {
return lastKeyValue, err
}

Copy link

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Applying the content hash to the segment filename here means the .seg name changes when content changes, but the index-building code still writes .idx files using IdxFileName(...) (which does not incorporate FileInfo.Hash). If .idx files/torrents are distributed (they are treated as seedable extensions elsewhere), this can still allow stale .idx/.idx.torrent downloads and even mismatched index+segment pairs. Consider propagating the same hash particle into index filenames as well (or otherwise tying index identity to the segment hash).

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants