Skip to content

fix(payloadstorage): retry failed epoch prunes instead of skipping#1336

Open
redstartechno wants to merge 1 commit into
gonka-ai:mainfrom
redstartechno:fix/850-managed-storage-prune-retry
Open

fix(payloadstorage): retry failed epoch prunes instead of skipping#1336
redstartechno wants to merge 1 commit into
gonka-ai:mainfrom
redstartechno:fix/850-managed-storage-prune-retry

Conversation

@redstartechno

Copy link
Copy Markdown

What problem does this solve?

Fixes #850. When an automatic epoch prune fails in ManagedStorage (DB outage, file I/O error), that epoch is permanently skipped — its payload data is never pruned and accumulates on disk indefinitely on API nodes.

How do you know this is a real problem?

In decentralized-api/payloadstorage/managed_storage.go, cleanup() spawned one goroutine per epoch to call storage.PruneEpoch, only logged failures, and advanced m.minPruned = threshold immediately — before the goroutines completed. The next tick starts from the new minPruned, so a failed epoch is never retried. The failure modes are realistic: HybridStorage.PruneEpoch already anticipates and logs PostgreSQL/file prune errors, and the issue reproduces them (intermittent DB connectivity). The package is unchanged since v0.2.8, so the bug is live on main.

How does this solve the problem?

cleanup() now computes the prune range under the cache mutex, releases it, and prunes epochs sequentially in a new pruneEpochs method, advancing minPruned only past epochs that pruned successfully. On the first failure it stops and returns, so that epoch is retried on the next 30-second tick. All three PruneEpoch implementations are idempotent (os.RemoveAll, DROP TABLE IF EXISTS), so retries are safe.

Note one deviation from the fix sketched in the issue: pruning runs outside the cache mutex rather than while holding it. Holding m.mu during storage I/O would block all Retrieve/Store cache operations for the duration of a slow prune (e.g. a PostgreSQL timeout). Since cleanupLoop is the only caller of cleanup(), sequential pruning is still guaranteed not to overlap with itself — which also resolves the secondary concern in the issue (concurrent PruneEpoch calls for the same epoch from overlapping ticks).

What risks does this introduce? How can we mitigate them?

  • Pruning is now sequential instead of parallel. Bounded by maxPruneLookback to at most 10 epochs per tick, and it runs on a background 30-second ticker, so throughput is not a concern.
  • A cleanup tick now blocks until pruning finishes. Cache eviction completes before pruning starts, and a delayed next tick only delays further pruning — cache reads/writes are unaffected because the mutex is released first.
  • Persistently failing epoch. It is retried each tick until the existing maxPruneLookback guard jumps minPruned forward — exactly the documented design ("only last 10 epochs, older data requires manual prune"), so failures cannot stall pruning forever.

Mitigation: regression tests below cover the stop-at-failure and retry-then-catch-up paths; existing tests cover the unchanged lookback and no-prune-below-retain behaviors.

How do you know this PR fixes the problem?

Two new regression tests using the package's existing mock seam (pruneCb hook follows the existing storeCb pattern):

  • TestManagedStorage_AutoPruneStopsAtFailedEpoch — a failing epoch halts the advance; minPruned stays at the failed epoch; later epochs are not pruned out of order.
  • TestManagedStorage_AutoPruneRetriesFailedEpoch — a transient failure is retried on the next cleanup; all epochs below threshold end up pruned exactly once; minPruned reaches threshold.

Both tests were verified to fail against the previous implementation (run with the old managed_storage.go restored via git stash) and pass with this change. The six pre-existing ManagedStorage tests still pass; three of them contained time.Sleep waits for the old async goroutines, which are removed since pruning is now synchronous.

Which components are affected?

  • decentralized-api/payloadstorage/managed_storage.go
  • decentralized-api/payloadstorage/managed_storage_test.go

No API or on-chain changes.

Testing & evidence

GOTOOLCHAIN=go1.24.2 go test ./payloadstorage/ -run TestManagedStorage -count=1 -v (the TestPostgresStorage_*/testcontainers tests need Docker and are unrelated).

Before

New regression tests run against the previous managed_storage.go (restored via git checkout main -- ...). Exact counts vary between runs because the old pruning was asynchronous — one representative run:

=== RUN   TestManagedStorage_AutoPruneStopsAtFailedEpoch
    managed_storage_test.go:220: expected 3 epochs pruned before failure, got 4: [0 1 2 4]
    managed_storage_test.go:224: epoch 4 should not be pruned past the failed epoch 3
    managed_storage_test.go:232: minPruned should stop at failed epoch 3, got 8
--- FAIL: TestManagedStorage_AutoPruneStopsAtFailedEpoch (0.00s)
=== RUN   TestManagedStorage_AutoPruneRetriesFailedEpoch
    managed_storage_test.go:261: expected 8 epochs pruned after retry, got 1: [1]
    managed_storage_test.go:272: epoch 0 was never pruned
    managed_storage_test.go:272: epoch 2 was never pruned
    managed_storage_test.go:272: epoch 3 was never pruned
    managed_storage_test.go:272: epoch 4 was never pruned
    managed_storage_test.go:272: epoch 5 was never pruned
    managed_storage_test.go:272: epoch 6 was never pruned
    managed_storage_test.go:272: epoch 7 was never pruned
--- FAIL: TestManagedStorage_AutoPruneRetriesFailedEpoch (0.00s)
FAIL
FAIL    decentralized-api/payloadstorage        0.099s

After

--- PASS: TestManagedStorage_CacheHit (0.00s)
--- PASS: TestManagedStorage_CacheExpiration (0.02s)
--- PASS: TestManagedStorage_StoreTracksMaxEpoch (0.00s)
--- PASS: TestManagedStorage_AutoPruneTriggersInCleanup (0.03s)
--- PASS: TestManagedStorage_AutoPruneSkipsOldEpochs (0.00s)
--- PASS: TestManagedStorage_AutoPruneStopsAtFailedEpoch (0.00s)
--- PASS: TestManagedStorage_AutoPruneRetriesFailedEpoch (0.00s)
--- PASS: TestManagedStorage_NoPruneWhenBelowRetainCount (0.00s)
PASS
ok      decentralized-api/payloadstorage        0.138s

🤖 Generated with Claude Code

ManagedStorage.cleanup() advanced minPruned to threshold before the
async prune goroutines completed, so any epoch whose PruneEpoch failed
(DB outage, file I/O error) was never retried and its payload data
accumulated on disk indefinitely.

Prune sequentially outside the cache lock and advance minPruned only
past epochs that pruned successfully; a failure stops the advance and
the epoch is retried on the next cleanup tick. The maxPruneLookback
guard still bounds catch-up, so persistently failing epochs are
eventually abandoned as designed. Pruning also no longer overlaps
with itself and never runs while holding the cache mutex.

Fixes gonka-ai#850

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: ManagedStorage silently skips failed epoch pruning — minPruned advanced before goroutines complete

1 participant