[feature:gsoc26] Add persistent failure-handler branch with exponential backoff retry scheduling

**Is your feature request related to a problem? Please describe.**

Today `_recoverable_failure_handler` on `AbstractUpgradeOperation` has exactly two branches: re-raise the exception (Celery auto-retries up to 4 times via `RETRY_OPTIONS` in `settings.py`, ~10 minutes of attempts total) or set `status="failed"` and walk away. There's no middle ground - "device is offline; preserve the operation and try again later" has nowhere to live. Ten minutes is plenty for a flaky link or transient SSH timeout, but useless for a warehouse device that's off overnight.

What I need to add is a third branch: when the handler is called and the operation has `is_is_persistent=True`, send it to `pending` instead of `failed`, schedule a retry with randomized exponential backoff (capped at 12 hours), and let sub-issue 04's Beat task pick it back up later.

**Describe the solution I would implement**

I would like to wire the new fields from sub-issue 01 into the failure handler so persistent operations land in `pending` instead of `failed`.

1. Insert a new branch in `_recoverable_failure_handler` (`base/models.py:896–904`), right before the existing terminal `status="failed"` line: when `recoverable=False`, `self.is_persistent=True`, AND `isinstance(error, RecoverableFailure)`, set `status="pending"`, bump `retry_count`, compute `next_retry_at` via `_calculate_next_retry()`, log the scheduled retry via `log_line(..., save=False)` (matching the existing terminal branch - the caller's `save()` flushes everything), and return.

2. The `isinstance(error, RecoverableFailure)` guard is defense-in-depth. Today the handler is only invoked from two places in `upgrade()` (`base/models.py:930` after exhausting all `DeviceConnection`s, and `:979` from the `except RecoverableFailure` arm), both already passing a `RecoverableFailure`. The other failure types in `upgrade()` (`ReconnectionFailed`, `UpgradeAborted`, `UpgradeCancelled`, and the bare `Exception` arm at `:982`) set their terminal status directly and never reach the handler. The isinstance check makes that intent explicit - if a future change introduces a third call site passing a different exception type, persistent retry still won't fire for it.

3. Add a `_calculate_next_retry()` method on `AbstractUpgradeOperation`. Since bullet 1 increments `retry_count` first, the math is:

   ```python
   delay = min(base * (multiplier ** (self.retry_count - 1)), max_delay)
   jittered = delay * random.uniform(1 - jitter, 1 + jitter)
   return timezone.now() + timedelta(seconds=jittered)
   ```

   With `retry_count=1` and `jitter=0`, the first delay is `base` (10 minutes by default).

4. Four configurable settings via the existing `getattr(settings, "OPENWISP_FIRMWARE_UPGRADER_*", default)` pattern:

   | Setting | Default | Purpose |
   |---|---|---|
   | `..._PERSISTENT_RETRY_BASE_DELAY` | 600 (10 min) | First retry delay |
   | `..._PERSISTENT_RETRY_MULTIPLIER` | 2 | Per-retry growth factor |
   | `..._PERSISTENT_RETRY_JITTER` | 0.25 | ±25% randomization to keep fleet retries from synchronizing |
   | `..._PERSISTENT_RETRY_MAX_DELAY` | 43200 (12h) | Per-retry cap |

5. Sequence without jitter: ~10m → ~20m → ~40m → ~80m → ~2.7h → ~5.3h → ~10.7h → 12h, then 12h thereafter until the admin cancels or the device gets deactivated (sub-issue 04's `retry_pending_upgrade` checks `is_deactivated()` and flips to `failed`).

6. The two `save()` call sites in `upgrade()` (`base/models.py:939` after the connection-failure handler, and the final `save()` at `:1001` after the `RecoverableFailure` arm) are bare `self.save()` with no `update_fields=` restriction, so the new persistence fields land in the same write as `status`.

7. Unit tests covering:
   - Existing recoverable path unchanged (regression).
   - Persistent + handler called with `RecoverableFailure` → `status="pending"`, `retry_count` bumped, `next_retry_at` in the future.
   - Persistent + handler called with a non-`RecoverableFailure` exception → `status="failed"` (validates the isinstance guard).
   - Non-persistent + handler called → `status="failed"` (regression).
   - Other failure paths in `upgrade()` (`ReconnectionFailed`, `UpgradeAborted`, `UpgradeCancelled`, bare `Exception`) still hit their terminal status even when the op has `is_is_persistent=True` - confirms call-site routing keeps them out of the handler.
   - Backoff math: delays land in `[base * 2^(n-1) * (1-jitter), base * 2^(n-1) * (1+jitter)]` for `retry_count` 1–7; cap kicks in at `retry_count=8`.
   - All four settings overrides honoured.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[feature:gsoc26] Add persistent failure-handler branch with exponential backoff retry scheduling #422

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Setting	Default	Purpose
`..._PERSISTENT_RETRY_BASE_DELAY`	600 (10 min)	First retry delay
`..._PERSISTENT_RETRY_MULTIPLIER`	2	Per-retry growth factor
`..._PERSISTENT_RETRY_JITTER`	0.25	±25% randomization to keep fleet retries from synchronizing
`..._PERSISTENT_RETRY_MAX_DELAY`	43200 (12h)	Per-retry cap

Uh oh!

[feature:gsoc26] Add persistent failure-handler branch with exponential backoff retry scheduling #422

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions