fix(payloadstorage): retry failed epoch prunes instead of skipping#1336
Open
redstartechno wants to merge 1 commit into
Open
fix(payloadstorage): retry failed epoch prunes instead of skipping#1336redstartechno wants to merge 1 commit into
redstartechno wants to merge 1 commit into
Conversation
ManagedStorage.cleanup() advanced minPruned to threshold before the async prune goroutines completed, so any epoch whose PruneEpoch failed (DB outage, file I/O error) was never retried and its payload data accumulated on disk indefinitely. Prune sequentially outside the cache lock and advance minPruned only past epochs that pruned successfully; a failure stops the advance and the epoch is retried on the next cleanup tick. The maxPruneLookback guard still bounds catch-up, so persistently failing epochs are eventually abandoned as designed. Pruning also no longer overlaps with itself and never runs while holding the cache mutex. Fixes gonka-ai#850 Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What problem does this solve?
Fixes #850. When an automatic epoch prune fails in
ManagedStorage(DB outage, file I/O error), that epoch is permanently skipped — its payload data is never pruned and accumulates on disk indefinitely on API nodes.How do you know this is a real problem?
In
decentralized-api/payloadstorage/managed_storage.go,cleanup()spawned one goroutine per epoch to callstorage.PruneEpoch, only logged failures, and advancedm.minPruned = thresholdimmediately — before the goroutines completed. The next tick starts from the newminPruned, so a failed epoch is never retried. The failure modes are realistic:HybridStorage.PruneEpochalready anticipates and logs PostgreSQL/file prune errors, and the issue reproduces them (intermittent DB connectivity). The package is unchanged since v0.2.8, so the bug is live onmain.How does this solve the problem?
cleanup()now computes the prune range under the cache mutex, releases it, and prunes epochs sequentially in a newpruneEpochsmethod, advancingminPrunedonly past epochs that pruned successfully. On the first failure it stops and returns, so that epoch is retried on the next 30-second tick. All threePruneEpochimplementations are idempotent (os.RemoveAll,DROP TABLE IF EXISTS), so retries are safe.Note one deviation from the fix sketched in the issue: pruning runs outside the cache mutex rather than while holding it. Holding
m.muduring storage I/O would block allRetrieve/Storecache operations for the duration of a slow prune (e.g. a PostgreSQL timeout). SincecleanupLoopis the only caller ofcleanup(), sequential pruning is still guaranteed not to overlap with itself — which also resolves the secondary concern in the issue (concurrentPruneEpochcalls for the same epoch from overlapping ticks).What risks does this introduce? How can we mitigate them?
maxPruneLookbackto at most 10 epochs per tick, and it runs on a background 30-second ticker, so throughput is not a concern.maxPruneLookbackguard jumpsminPrunedforward — exactly the documented design ("only last 10 epochs, older data requires manual prune"), so failures cannot stall pruning forever.Mitigation: regression tests below cover the stop-at-failure and retry-then-catch-up paths; existing tests cover the unchanged lookback and no-prune-below-retain behaviors.
How do you know this PR fixes the problem?
Two new regression tests using the package's existing mock seam (
pruneCbhook follows the existingstoreCbpattern):TestManagedStorage_AutoPruneStopsAtFailedEpoch— a failing epoch halts the advance;minPrunedstays at the failed epoch; later epochs are not pruned out of order.TestManagedStorage_AutoPruneRetriesFailedEpoch— a transient failure is retried on the next cleanup; all epochs below threshold end up pruned exactly once;minPrunedreaches threshold.Both tests were verified to fail against the previous implementation (run with the old
managed_storage.gorestored viagit stash) and pass with this change. The six pre-existingManagedStoragetests still pass; three of them containedtime.Sleepwaits for the old async goroutines, which are removed since pruning is now synchronous.Which components are affected?
decentralized-api/payloadstorage/managed_storage.godecentralized-api/payloadstorage/managed_storage_test.goNo API or on-chain changes.
Testing & evidence
GOTOOLCHAIN=go1.24.2 go test ./payloadstorage/ -run TestManagedStorage -count=1 -v(theTestPostgresStorage_*/testcontainers tests need Docker and are unrelated).Before
New regression tests run against the previous
managed_storage.go(restored viagit checkout main -- ...). Exact counts vary between runs because the old pruning was asynchronous — one representative run:After
🤖 Generated with Claude Code