fix(gc): skip deleted miner actors in StorageGCMark#1002
fix(gc): skip deleted miner actors in StorageGCMark#1002
Conversation
magik6k
left a comment
There was a problem hiding this comment.
Looks right, lmk if you're able to test this case
Test Results — Calibration NetworkEnvironment
Before Fix (main branch)
After Fix
Live Miner Protection VerifiedFiletype breakdown for
Removed Miner
|
0ec8fc0 to
f0690de
Compare
|
There is a case when user might incorrect build for incorrect network or forgot to upgrade on NW upgrades. We will see similar errors there as well. How do we safeguard against that? |
|
Good point — pushed a safeguard for this. Before marking any sectors as orphaned, we now cross-check the miner address against all
So a wrong-network build or missed upgrade would hit the first case — the miner is still in config, so GC refuses to touch it. The orphaned path only fires when the miner is genuinely gone from both chain and config. Combined with the existing two-phase approve/sweep system, that's two layers of protection against accidental deletion. What do you think — does this address the concern? |
When a miner actor no longer exists on-chain (e.g. removed from config after deletion), StorageGCMark would fail with 'actor not found' and enter a permanent retry loop, blocking all storage GC. Handle the actor-not-found case gracefully in both Stage 1 (sector liveness check) and Stage 3 (snap sector-key cleanup): - Stage 1: Skip loading miner state for deleted actors. Their sectors remain in the toRemove set since there are no on-chain precommits, live, or unproven sectors to subtract. - Stage 3: Skip finality-tipset actor lookups for deleted miners. Snap sector-key cleanup is irrelevant for non-existent miners. Only the specific 'actor not found' error triggers this path. Transient RPC errors (timeouts, connection issues) still fail the task as before, preventing accidental GC of sectors for healthy miners during network disruptions. Fixes a scenario where removing a calibration/test miner from config causes StorageGCMark to fail 100% of runs indefinitely.
The continue skipped toRemove.Set() for the first sector discovered for a dead miner. Subsequent sectors were fine since they bypassed the if-block. Now explicitly set the first sector before continuing.
Lex raised a valid concern: 'actor not found' can also happen when a node is built for the wrong network or missed a network upgrade. In that case, the miner is healthy but the node can't see it. Added a cross-check: before treating an 'actor not found' miner as deleted, verify it doesn't appear in any harmony_config layer. If it does, the error is likely a misconfiguration — fail the task loudly instead of marking sectors for GC. The orphaned-sector GC path now only triggers when: 1. StateGetActor returns 'actor not found', AND 2. The miner address is NOT in any config layer This prevents accidental GC of sectors for healthy miners that appear missing due to wrong-network or upgrade issues.
ee52494 to
7dd195e
Compare


Problem
When a miner actor is removed from the chain (e.g. a test/calibration miner that was killed), but its sector files still exist in storage paths,
StorageGCMarkcallsStateGetActorand gets"actor not found". This is treated as a fatal error, causing the task to fail and retry indefinitely (100% failure rate, every 9 minutes).Root Cause
StorageGCMark.Do()iterates all sectors in storage paths, loads miner actor state for each unique miner ID, then checks precommits/live/unproven sectors to decide what to GC. TwoStateGetActorcall sites had no handling for deleted actors:Fix
Handle
"actor not found"specifically at both sites:toRemove— they are orphaned since the miner has no on-chain precommits, live, or unproven sectors to subtract.Safety: Edge Cases Considered
actor not foundOnly the specific
"actor not found"string triggers the skip. Any otherStateGetActorerror (transient RPC failures, timeouts) still fails the task, preventing accidental GC of sectors for healthy miners.Uses string matching (
strings.Contains) consistent with existing callers incmd/sptool/toolbox_deal_client.go, since the typedapi.ErrActorNotFoundmay not survive the JSON-RPC round-trip.