-
Notifications
You must be signed in to change notification settings - Fork 8
Description
Detect stalled data-retention counters (no NextProvingPeriod fired)
Problem
The data-retention check relies on subgraph-confirmed totals (totalProvingPeriods, totalFaultedPeriods) to track provider performance. These totals only update when the NextProvingPeriod event fires on-chain.. which requires a transaction advancing the proving period.
If an SP stops interacting with the PDP contract entirely (goes offline, abandons a proof set, or simply never submits proofs), NextProvingPeriod never fires and the subgraph stays frozen. The provider's metrics in dealbot don't disappear.. they just freeze. Whatever fault rate they had at the time they went silent stays as their reported rate indefinitely (e.g. a provider at 0% faults stays at 0%, one at 10% stays at 10%). Meanwhile, their actual fault rate is climbing toward 100% since every missed period is effectively a fault.
When NextProvingPeriod eventually fires again, the subgraph handler catches up and records all skipped periods as faults. But if the SP is permanently gone.. that catch-up never happens.
Context
PR #365 removed a previous attempt at handling this (overdue period estimation) because it systematically inflated fault rates. The estimation speculatively counted overdue periods as faults, but when the subgraph later confirmed them as successes, the corrections were discarded by the negative-delta guard — permanently baking phantom faults into Prometheus counters.
The fix was correct, but it leaves this blind spot unaddressed.
Proposed approaches
Either way.. the key constraint is we can't touch the confirmed fault/success counters (that's what broke things before). Both approaches below keep those counters clean.
Option A: Staleness alert
The simpler option.. just track when a provider's subgraph totals last changed and emit a gauge or alert when they've been stale for too long (e.g. no NextProvingPeriod in a day/week/etc).
- Emit a gauge metric (e.g.
pdp_provider_periods_since_last_update) or alert when a provider hasn't had activity beyond some threshold - Simple to implement.. but the dashboard still shows the frozen (stale) fault rate
- Someone has to manually interpret that the staleness alert + frozen rate = provider is actually failing
Option B: Separate overdue gauge (preferred)
Re-add nextDeadline and maxProvingPeriod to the subgraph query (we had these before #365 removed them) and compute overdue periods:
overduePeriods = floor((currentBlock - nextDeadline) / maxProvingPeriod)
..but instead of adding them to the fault counters (which is what broke things before), emit them as a separate gauge (e.g. pdp_provider_overdue_periods).
- The confirmed counters stay accurate (only subgraph-confirmed data)
- The overdue gauge shows estimated unrecorded faults in real time
- Dashboards can combine them for an effective fault rate:
(confirmed_faults + overdue_periods) / (confirmed_total + overdue_periods) - The gauge naturally resets to 0 when
NextProvingPeriodfinally fires and the subgraph catches up.. no delta corruption because gauges can go up and down
This gives us the "correct" fault rate automatically without anyone needing to interpret an alert.
References
Metadata
Metadata
Assignees
Labels
Type
Projects
Status