fix(gc): skip deleted miner actors in StorageGCMark by Reiers · Pull Request #1002 · filecoin-project/curio

Reiers · 2026-02-16T17:00:22Z

Problem

When a miner actor is removed from the chain (e.g. a test/calibration miner that was killed), but its sector files still exist in storage paths, StorageGCMark calls StateGetActor and gets "actor not found". This is treated as a fatal error, causing the task to fail and retry indefinitely (100% failure rate, every 9 minutes).

Root Cause

StorageGCMark.Do() iterates all sectors in storage paths, loads miner actor state for each unique miner ID, then checks precommits/live/unproven sectors to decide what to GC. Two StateGetActor call sites had no handling for deleted actors:

Stage 1 (line ~112): Loading miner state at chain head for liveness checks
Stage 3 (line ~250): Loading miner state at finality height for snap sector-key cleanup

Fix

Handle "actor not found" specifically at both sites:

Stage 1: Skip loading miner state. Sectors stay in toRemove — they are orphaned since the miner has no on-chain precommits, live, or unproven sectors to subtract.
Stage 3: Skip finality-tipset lookup. Snap sector-key cleanup is irrelevant for non-existent miners.

Safety: Edge Cases Considered

Scenario	Actor on-chain?	StateGetActor result	GC behavior
Miner removed from config but alive on-chain	✅	Success	Protected — normal live/unproven subtraction
Miner truly deleted / never existed	❌	`actor not found`	GC orphaned files ✅
RPC timeout / network glitch	✅	Other error	Task fails & retries — no accidental GC
Mid-migration, miner healthy elsewhere	✅	Success	Protected — actor still exists
Miner faulted / missed deadlines	✅	Success	Protected — actor still exists

Only the specific "actor not found" string triggers the skip. Any other StateGetActor error (transient RPC failures, timeouts) still fails the task, preventing accidental GC of sectors for healthy miners.

Uses string matching (strings.Contains) consistent with existing callers in cmd/sptool/toolbox_deal_client.go, since the typed api.ErrActorNotFound may not survive the JSON-RPC round-trip.

magik6k

Looks right, lmk if you're able to test this case

Reiers · 2026-02-16T18:57:04Z

Test Results — Calibration Network

Environment

3 miners in storage paths: 2 live (t0143103 — 3,333 active sectors, t0144416 — 452 active), 1 removed (t0138097 — actor exists on-chain but removed from config)
StorageGCMark was stuck in a 100% failure loop (~4,985 consecutive failures)

Before Fix (main branch)

StorageGCMark crashes every run: Failed error: get miner actor: actor not found
No GC marks written — task crashes before the marking transaction

After Fix

StorageGCMark completes successfully, error is gone
GC marks written, sweep ran, sectors removed from DB and Web UI

Live Miner Protection Verified

Filetype breakdown for t0143103 (3,333 active + 621 terminated sectors):

Filetype	Type	Count	Source
1	Unsealed	598	Terminated sectors (Stage 2)
2	Sealed	3,545	~598 terminated + ~2,947 snap-sector-keys (Stage 3)
4	Cache	607	Terminated sectors (Stage 1)
8	Update	598	Terminated sectors (Stage 1)
16	Update-cache	598	Terminated sectors (Stage 1)

Filetypes 1, 4, 8, 16 all cluster at ~598 — terminated sectors only
Filetype 2 (sealed) excess of ~2,947 are snap-sector-key marks from Stage 3 (obsolete pre-snap sealed copies replaced by update files)
No active sector had update, update-cache, cache, or unsealed files marked
t0144416 (452 active sectors) — 0 GC marks

Removed Miner

t0138097 (actor exists on-chain, removed from config): 3,660 files marked and cleaned

Reiers · 2026-02-16T18:58:43Z

Before:

After:

LexLuthr · 2026-02-17T07:33:35Z

There is a case when user might incorrect build for incorrect network or forgot to upgrade on NW upgrades. We will see similar errors there as well. How do we safeguard against that?

Reiers · 2026-02-18T13:07:47Z

Good point — pushed a safeguard for this.

Before marking any sectors as orphaned, we now cross-check the miner address against all harmony_config layers. The logic is:

Miner not found on-chain + IS in config → Task fails loudly: "miner f0X is configured but not found on-chain — possible network mismatch, refusing to GC". Nothing gets marked. Operator sees the error and investigates.
Miner not found on-chain + NOT in config → Sectors treated as orphaned, marked for GC (still requires manual approval before sweep deletes).

So a wrong-network build or missed upgrade would hit the first case — the miner is still in config, so GC refuses to touch it. The orphaned path only fires when the miner is genuinely gone from both chain and config.

Combined with the existing two-phase approve/sweep system, that's two layers of protection against accidental deletion.

What do you think — does this address the concern?

When a miner actor no longer exists on-chain (e.g. removed from config after deletion), StorageGCMark would fail with 'actor not found' and enter a permanent retry loop, blocking all storage GC. Handle the actor-not-found case gracefully in both Stage 1 (sector liveness check) and Stage 3 (snap sector-key cleanup): - Stage 1: Skip loading miner state for deleted actors. Their sectors remain in the toRemove set since there are no on-chain precommits, live, or unproven sectors to subtract. - Stage 3: Skip finality-tipset actor lookups for deleted miners. Snap sector-key cleanup is irrelevant for non-existent miners. Only the specific 'actor not found' error triggers this path. Transient RPC errors (timeouts, connection issues) still fail the task as before, preventing accidental GC of sectors for healthy miners during network disruptions. Fixes a scenario where removing a calibration/test miner from config causes StorageGCMark to fail 100% of runs indefinitely.

The continue skipped toRemove.Set() for the first sector discovered for a dead miner. Subsequent sectors were fine since they bypassed the if-block. Now explicitly set the first sector before continuing.

Lex raised a valid concern: 'actor not found' can also happen when a node is built for the wrong network or missed a network upgrade. In that case, the miner is healthy but the node can't see it. Added a cross-check: before treating an 'actor not found' miner as deleted, verify it doesn't appear in any harmony_config layer. If it does, the error is likely a misconfiguration — fail the task loudly instead of marking sectors for GC. The orphaned-sector GC path now only triggers when: 1. StateGetActor returns 'actor not found', AND 2. The miner address is NOT in any config layer This prevents accidental GC of sectors for healthy miners that appear missing due to wrong-network or upgrade issues.

magik6k reviewed Feb 16, 2026

View reviewed changes

Reiers force-pushed the fix/gc-mark-skip-deleted-miners branch from 0ec8fc0 to f0690de Compare February 16, 2026 18:57

Reiers marked this pull request as ready for review February 16, 2026 18:57

Reiers requested a review from a team as a code owner February 16, 2026 18:57

Reiers changed the title ~~WIP: fix(gc): skip deleted miner actors in StorageGCMark~~ fix(gc): skip deleted miner actors in StorageGCMark Feb 16, 2026

Reiers requested review from LexLuthr and magik6k and removed request for magik6k February 16, 2026 19:08

Reiers added 3 commits February 18, 2026 14:56

fix: include first sector in toRemove before continue on actor-not-found

0243f4a

The continue skipped toRemove.Set() for the first sector discovered for a dead miner. Subsequent sectors were fine since they bypassed the if-block. Now explicitly set the first sector before continuing.

Reiers force-pushed the fix/gc-mark-skip-deleted-miners branch from ee52494 to 7dd195e Compare February 18, 2026 13:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

fix(gc): skip deleted miner actors in StorageGCMark#1002

fix(gc): skip deleted miner actors in StorageGCMark#1002
Reiers wants to merge 3 commits intomainfrom
fix/gc-mark-skip-deleted-miners

Reiers commented Feb 16, 2026

Uh oh!

magik6k left a comment

Uh oh!

Reiers commented Feb 16, 2026

Uh oh!

Reiers commented Feb 16, 2026

Uh oh!

LexLuthr commented Feb 17, 2026

Uh oh!

Reiers commented Feb 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

Conversation

Reiers commented Feb 16, 2026

Problem

Root Cause

Fix

Safety: Edge Cases Considered

Uh oh!

magik6k left a comment

Choose a reason for hiding this comment

Uh oh!

Reiers commented Feb 16, 2026

Test Results — Calibration Network

Environment

Before Fix (main branch)

After Fix

Live Miner Protection Verified

Removed Miner

Uh oh!

Reiers commented Feb 16, 2026

Uh oh!

LexLuthr commented Feb 17, 2026

Uh oh!

Reiers commented Feb 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants