Skip to content

[feature:gsoc26] Add persistent failure-handler branch with exponential backoff retry scheduling #422

Description

@Eeshu-Yadav

Is your feature request related to a problem? Please describe.

Today _recoverable_failure_handler on AbstractUpgradeOperation has exactly two branches: re-raise the exception (Celery auto-retries up to 4 times via RETRY_OPTIONS in settings.py, ~10 minutes of attempts total) or set status="failed" and walk away. There's no middle ground - "device is offline; preserve the operation and try again later" has nowhere to live. Ten minutes is plenty for a flaky link or transient SSH timeout, but useless for a warehouse device that's off overnight.

What I need to add is a third branch: when the handler is called and the operation has is_is_persistent=True, send it to pending instead of failed, schedule a retry with randomized exponential backoff (capped at 12 hours), and let sub-issue 04's Beat task pick it back up later.

Describe the solution I would implement

I would like to wire the new fields from sub-issue 01 into the failure handler so persistent operations land in pending instead of failed.

  1. Insert a new branch in _recoverable_failure_handler (base/models.py:896–904), right before the existing terminal status="failed" line: when recoverable=False, self.is_persistent=True, AND isinstance(error, RecoverableFailure), set status="pending", bump retry_count, compute next_retry_at via _calculate_next_retry(), log the scheduled retry via log_line(..., save=False) (matching the existing terminal branch - the caller's save() flushes everything), and return.

  2. The isinstance(error, RecoverableFailure) guard is defense-in-depth. Today the handler is only invoked from two places in upgrade() (base/models.py:930 after exhausting all DeviceConnections, and :979 from the except RecoverableFailure arm), both already passing a RecoverableFailure. The other failure types in upgrade() (ReconnectionFailed, UpgradeAborted, UpgradeCancelled, and the bare Exception arm at :982) set their terminal status directly and never reach the handler. The isinstance check makes that intent explicit - if a future change introduces a third call site passing a different exception type, persistent retry still won't fire for it.

  3. Add a _calculate_next_retry() method on AbstractUpgradeOperation. Since bullet 1 increments retry_count first, the math is:

    delay = min(base * (multiplier ** (self.retry_count - 1)), max_delay)
    jittered = delay * random.uniform(1 - jitter, 1 + jitter)
    return timezone.now() + timedelta(seconds=jittered)

    With retry_count=1 and jitter=0, the first delay is base (10 minutes by default).

  4. Four configurable settings via the existing getattr(settings, "OPENWISP_FIRMWARE_UPGRADER_*", default) pattern:

    Setting Default Purpose
    ..._PERSISTENT_RETRY_BASE_DELAY 600 (10 min) First retry delay
    ..._PERSISTENT_RETRY_MULTIPLIER 2 Per-retry growth factor
    ..._PERSISTENT_RETRY_JITTER 0.25 ±25% randomization to keep fleet retries from synchronizing
    ..._PERSISTENT_RETRY_MAX_DELAY 43200 (12h) Per-retry cap
  5. Sequence without jitter: ~10m → ~20m → ~40m → ~80m → ~2.7h → ~5.3h → ~10.7h → 12h, then 12h thereafter until the admin cancels or the device gets deactivated (sub-issue 04's retry_pending_upgrade checks is_deactivated() and flips to failed).

  6. The two save() call sites in upgrade() (base/models.py:939 after the connection-failure handler, and the final save() at :1001 after the RecoverableFailure arm) are bare self.save() with no update_fields= restriction, so the new persistence fields land in the same write as status.

  7. Unit tests covering:

    • Existing recoverable path unchanged (regression).
    • Persistent + handler called with RecoverableFailurestatus="pending", retry_count bumped, next_retry_at in the future.
    • Persistent + handler called with a non-RecoverableFailure exception → status="failed" (validates the isinstance guard).
    • Non-persistent + handler called → status="failed" (regression).
    • Other failure paths in upgrade() (ReconnectionFailed, UpgradeAborted, UpgradeCancelled, bare Exception) still hit their terminal status even when the op has is_is_persistent=True - confirms call-site routing keeps them out of the handler.
    • Backoff math: delays land in [base * 2^(n-1) * (1-jitter), base * 2^(n-1) * (1+jitter)] for retry_count 1–7; cap kicks in at retry_count=8.
    • All four settings overrides honoured.

Metadata

Metadata

Assignees

Labels

enhancementNew feature or requestgsoc-ideaIssues part of Google Summer of Code project

Type

No fields configured for Task.

Projects

Status
To do general
Status
In progress

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions