Skip to content

Comments

fix(gc): skip deleted miner actors in StorageGCMark#1002

Open
Reiers wants to merge 3 commits intomainfrom
fix/gc-mark-skip-deleted-miners
Open

fix(gc): skip deleted miner actors in StorageGCMark#1002
Reiers wants to merge 3 commits intomainfrom
fix/gc-mark-skip-deleted-miners

Conversation

@Reiers
Copy link
Contributor

@Reiers Reiers commented Feb 16, 2026

Problem

When a miner actor is removed from the chain (e.g. a test/calibration miner that was killed), but its sector files still exist in storage paths, StorageGCMark calls StateGetActor and gets "actor not found". This is treated as a fatal error, causing the task to fail and retry indefinitely (100% failure rate, every 9 minutes).

Root Cause

StorageGCMark.Do() iterates all sectors in storage paths, loads miner actor state for each unique miner ID, then checks precommits/live/unproven sectors to decide what to GC. Two StateGetActor call sites had no handling for deleted actors:

  1. Stage 1 (line ~112): Loading miner state at chain head for liveness checks
  2. Stage 3 (line ~250): Loading miner state at finality height for snap sector-key cleanup

Fix

Handle "actor not found" specifically at both sites:

  • Stage 1: Skip loading miner state. Sectors stay in toRemove — they are orphaned since the miner has no on-chain precommits, live, or unproven sectors to subtract.
  • Stage 3: Skip finality-tipset lookup. Snap sector-key cleanup is irrelevant for non-existent miners.

Safety: Edge Cases Considered

Scenario Actor on-chain? StateGetActor result GC behavior
Miner removed from config but alive on-chain Success Protected — normal live/unproven subtraction
Miner truly deleted / never existed actor not found GC orphaned files
RPC timeout / network glitch Other error Task fails & retries — no accidental GC
Mid-migration, miner healthy elsewhere Success Protected — actor still exists
Miner faulted / missed deadlines Success Protected — actor still exists

Only the specific "actor not found" string triggers the skip. Any other StateGetActor error (transient RPC failures, timeouts) still fails the task, preventing accidental GC of sectors for healthy miners.

Uses string matching (strings.Contains) consistent with existing callers in cmd/sptool/toolbox_deal_client.go, since the typed api.ErrActorNotFound may not survive the JSON-RPC round-trip.

Copy link
Collaborator

@magik6k magik6k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks right, lmk if you're able to test this case

@Reiers
Copy link
Contributor Author

Reiers commented Feb 16, 2026

Test Results — Calibration Network

Environment

  • 3 miners in storage paths: 2 live (t0143103 — 3,333 active sectors, t0144416 — 452 active), 1 removed (t0138097 — actor exists on-chain but removed from config)
  • StorageGCMark was stuck in a 100% failure loop (~4,985 consecutive failures)

Before Fix (main branch)

  • StorageGCMark crashes every run: Failed error: get miner actor: actor not found
  • No GC marks written — task crashes before the marking transaction

After Fix

  • StorageGCMark completes successfully, error is gone
  • GC marks written, sweep ran, sectors removed from DB and Web UI

Live Miner Protection Verified

Filetype breakdown for t0143103 (3,333 active + 621 terminated sectors):

Filetype Type Count Source
1 Unsealed 598 Terminated sectors (Stage 2)
2 Sealed 3,545 ~598 terminated + ~2,947 snap-sector-keys (Stage 3)
4 Cache 607 Terminated sectors (Stage 1)
8 Update 598 Terminated sectors (Stage 1)
16 Update-cache 598 Terminated sectors (Stage 1)
  • Filetypes 1, 4, 8, 16 all cluster at ~598 — terminated sectors only
  • Filetype 2 (sealed) excess of ~2,947 are snap-sector-key marks from Stage 3 (obsolete pre-snap sealed copies replaced by update files)
  • No active sector had update, update-cache, cache, or unsealed files marked
  • t0144416 (452 active sectors) — 0 GC marks

Removed Miner

  • t0138097 (actor exists on-chain, removed from config): 3,660 files marked and cleaned

@Reiers Reiers force-pushed the fix/gc-mark-skip-deleted-miners branch from 0ec8fc0 to f0690de Compare February 16, 2026 18:57
@Reiers Reiers marked this pull request as ready for review February 16, 2026 18:57
@Reiers Reiers requested a review from a team as a code owner February 16, 2026 18:57
@Reiers Reiers changed the title WIP: fix(gc): skip deleted miner actors in StorageGCMark fix(gc): skip deleted miner actors in StorageGCMark Feb 16, 2026
@Reiers
Copy link
Contributor Author

Reiers commented Feb 16, 2026

Before:

Skjermbilde 2026-02-16 kl  18 29 53

After:

Skjermbilde 2026-02-16 kl  19 48 22

@Reiers Reiers requested review from LexLuthr and magik6k and removed request for magik6k February 16, 2026 19:08
@LexLuthr
Copy link
Contributor

There is a case when user might incorrect build for incorrect network or forgot to upgrade on NW upgrades. We will see similar errors there as well. How do we safeguard against that?

@Reiers
Copy link
Contributor Author

Reiers commented Feb 18, 2026

Good point — pushed a safeguard for this.

Before marking any sectors as orphaned, we now cross-check the miner address against all harmony_config layers. The logic is:

  • Miner not found on-chain + IS in config → Task fails loudly: "miner f0X is configured but not found on-chain — possible network mismatch, refusing to GC". Nothing gets marked. Operator sees the error and investigates.
  • Miner not found on-chain + NOT in config → Sectors treated as orphaned, marked for GC (still requires manual approval before sweep deletes).

So a wrong-network build or missed upgrade would hit the first case — the miner is still in config, so GC refuses to touch it. The orphaned path only fires when the miner is genuinely gone from both chain and config.

Combined with the existing two-phase approve/sweep system, that's two layers of protection against accidental deletion.

What do you think — does this address the concern?

When a miner actor no longer exists on-chain (e.g. removed from config
after deletion), StorageGCMark would fail with 'actor not found' and
enter a permanent retry loop, blocking all storage GC.

Handle the actor-not-found case gracefully in both Stage 1 (sector
liveness check) and Stage 3 (snap sector-key cleanup):

- Stage 1: Skip loading miner state for deleted actors. Their sectors
  remain in the toRemove set since there are no on-chain precommits,
  live, or unproven sectors to subtract.
- Stage 3: Skip finality-tipset actor lookups for deleted miners.
  Snap sector-key cleanup is irrelevant for non-existent miners.

Only the specific 'actor not found' error triggers this path. Transient
RPC errors (timeouts, connection issues) still fail the task as before,
preventing accidental GC of sectors for healthy miners during network
disruptions.

Fixes a scenario where removing a calibration/test miner from config
causes StorageGCMark to fail 100% of runs indefinitely.
The continue skipped toRemove.Set() for the first sector discovered
for a dead miner. Subsequent sectors were fine since they bypassed
the if-block. Now explicitly set the first sector before continuing.
Lex raised a valid concern: 'actor not found' can also happen when a node
is built for the wrong network or missed a network upgrade. In that case,
the miner is healthy but the node can't see it.

Added a cross-check: before treating an 'actor not found' miner as deleted,
verify it doesn't appear in any harmony_config layer. If it does, the error
is likely a misconfiguration — fail the task loudly instead of marking
sectors for GC.

The orphaned-sector GC path now only triggers when:
1. StateGetActor returns 'actor not found', AND
2. The miner address is NOT in any config layer

This prevents accidental GC of sectors for healthy miners that appear
missing due to wrong-network or upgrade issues.
@Reiers Reiers force-pushed the fix/gc-mark-skip-deleted-miners branch from ee52494 to 7dd195e Compare February 18, 2026 13:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants