Skip to content

Comments

pdpv0: re-enable DELETE in processPendingCleanup to stop RPC flood#1033

Merged
ZenGround0 merged 1 commit intofilecoin-project:pdpv0from
TippyFlitsUK:tippy/pdpv0-deletion-flood-fix
Feb 19, 2026
Merged

pdpv0: re-enable DELETE in processPendingCleanup to stop RPC flood#1033
ZenGround0 merged 1 commit intofilecoin-project:pdpv0from
TippyFlitsUK:tippy/pdpv0-deletion-flood-fix

Conversation

@TippyFlitsUK
Copy link
Contributor

Problem

Commit #947 fixed a lo.Contains pointer comparison bug in processPendingPieceDeletes,
which caused pieces to be correctly marked removed=TRUE for the first time. On calibration
nodes with ~2,900 such pieces, this had an unintended side effect:

processPendingCleanup runs on every Filecoin block (~30s) and calls verifier.PieceLive()
for every piece where removed=TRUE. The DELETE that would remove confirmed-dead pieces from
the table was commented out (// XXX(Kubuxu): commented out as this has lead to proving failures),
so the list never shrinks.

Result: ~2,900 eth_call RPC requests to Lotus per block, continuous and permanent.

Impact:

  • Lotus overwhelmed with FVM instantiations ("using FVM V1" at 20+ per second)
  • eth_estimateGas calls time out with i/o timeout
  • Proving tasks fail, triggering exponential backoff
  • After 5 consecutive failures, datasets marked unrecoverable_proving_failure_epoch
  • 50+ datasets per hour becoming unrecoverable on calibration nodes

Mainnet unaffected (only 67 removed=TRUE pieces vs ~2,900 on calibration).

Fix

Re-enable the DELETE with an AND removed=TRUE guard. processPendingCleanup already
calls PieceLive() immediately before the DELETE to confirm the piece is gone on-chain,
making this safe. Once the backlog is cleared (one-time cost on first run), the function
has zero rows to process and makes zero RPC calls per block.

processPendingCleanup runs on every Filecoin block and calls PieceLive()
for every piece with removed=TRUE in pdp_data_set_pieces. The DELETE that
would remove confirmed-dead pieces from the table was commented out, causing
the list to grow without bound and flood Lotus with EthCalls.

After the b7a8796 fix (lo.Contains pointer comparison bug), pieces are now
correctly marked removed=TRUE for the first time. On calibration nodes with
~2,900 such pieces, this caused ~2,900 PieceLive() EthCalls per block
(~30s interval), overwhelming Lotus RPC and causing i/o timeouts on all
subsequent eth_estimateGas calls. This in turn caused proving tasks to fail,
and after 5 consecutive failures datasets were marked unrecoverable.

Re-enable the DELETE with an AND removed=TRUE guard for safety. PieceLive()
is called immediately before the DELETE and confirms the piece is gone
on-chain, making this safe. The list drains to zero over a few blocks and
the EthCall flood stops permanently.
@github-actions github-actions bot added the team/fs-wg Items being worked on or tracked by the "FS Working Group". See FilOzone/github-mgmt #10 label Feb 19, 2026
Copy link
Collaborator

@ZenGround0 ZenGround0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets do it! Next up we'll root out the proving failures.

@ZenGround0 ZenGround0 merged commit 1e6f069 into filecoin-project:pdpv0 Feb 19, 2026
16 checks passed
@rjan90 rjan90 added this to the M4.1: mainnet ready milestone Feb 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

team/fs-wg Items being worked on or tracked by the "FS Working Group". See FilOzone/github-mgmt #10

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants