Detect stalled data-retention counters (no `NextProvingPeriod` fired)

## Detect stalled data-retention counters (no `NextProvingPeriod` fired)

### Problem

The data-retention check relies on subgraph-confirmed totals (`totalProvingPeriods`, `totalFaultedPeriods`) to track provider performance. These totals only update when the `NextProvingPeriod` event fires on-chain.. which requires a transaction advancing the proving period.

If an SP stops interacting with the PDP contract entirely (goes offline, abandons a proof set, or simply never submits proofs), `NextProvingPeriod` never fires and the subgraph stays frozen. The provider's metrics in dealbot don't disappear.. they just freeze. Whatever fault rate they had at the time they went silent stays as their reported rate indefinitely (e.g. a provider at 0% faults stays at 0%, one at 10% stays at 10%). Meanwhile, their *actual* fault rate is climbing toward 100% since every missed period is effectively a fault.

When `NextProvingPeriod` eventually fires again, the subgraph handler catches up and records all skipped periods as faults. But if the SP is permanently gone.. that catch-up never happens.

### Context

PR #365 removed a previous attempt at handling this (overdue period estimation) because it systematically inflated fault rates. The estimation speculatively counted overdue periods as faults, but when the subgraph later confirmed them as successes, the corrections were discarded by the negative-delta guard — permanently baking phantom faults into Prometheus counters.

The fix was correct, but it leaves this blind spot unaddressed.

### Proposed approaches

Either way.. the key constraint is we can't touch the confirmed fault/success counters (that's what broke things before). Both approaches below keep those counters clean.

#### Option A: Staleness alert

The simpler option.. just track when a provider's subgraph totals last changed and emit a gauge or alert when they've been stale for too long (e.g. no `NextProvingPeriod` in a day/week/etc).

- Emit a gauge metric (e.g. `pdp_provider_periods_since_last_update`) or alert when a provider hasn't had activity beyond some threshold
- Simple to implement.. but the dashboard still shows the frozen (stale) fault rate
- Someone has to manually interpret that the staleness alert + frozen rate = provider is actually failing

#### Option B: Separate overdue gauge (preferred)

Re-add `nextDeadline` and `maxProvingPeriod` to the subgraph query (we had these before #365 removed them) and compute overdue periods:

```
overduePeriods = floor((currentBlock - nextDeadline) / maxProvingPeriod)
```

..but instead of adding them to the fault counters (which is what broke things before), emit them as a **separate gauge** (e.g. `pdp_provider_overdue_periods`).

- The confirmed counters stay accurate (only subgraph-confirmed data)
- The overdue gauge shows estimated unrecorded faults in real time
- Dashboards can combine them for an effective fault rate: `(confirmed_faults + overdue_periods) / (confirmed_total + overdue_periods)`
- The gauge naturally resets to 0 when `NextProvingPeriod` finally fires and the subgraph catches up.. no delta corruption because gauges can go up and down

This gives us the "correct" fault rate automatically without anyone needing to interpret an alert.

### References

- #365 removed overdue estimation
- [Comment raising this concern](https://github.com/FilOzone/dealbot/pull/365#issuecomment-4080070232)
- [PDP subgraph `NextProvingPeriod` handler](https://github.com/FilOzone/pdp-explorer/blob/main/subgraph/src/pdp-verifier.ts)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Detect stalled data-retention counters (no `NextProvingPeriod` fired) #374