Is your feature request related to a problem? Please describe.
Today _recoverable_failure_handler on AbstractUpgradeOperation has exactly two branches: re-raise the exception (Celery auto-retries up to 4 times via RETRY_OPTIONS in settings.py, ~10 minutes of attempts total) or set status="failed" and walk away. There's no middle ground - "device is offline; preserve the operation and try again later" has nowhere to live. Ten minutes is plenty for a flaky link or transient SSH timeout, but useless for a warehouse device that's off overnight.
What I need to add is a third branch: when the handler is called and the operation has is_is_persistent=True, send it to pending instead of failed, schedule a retry with randomized exponential backoff (capped at 12 hours), and let sub-issue 04's Beat task pick it back up later.
Describe the solution I would implement
I would like to wire the new fields from sub-issue 01 into the failure handler so persistent operations land in pending instead of failed.
-
Insert a new branch in _recoverable_failure_handler (base/models.py:896–904), right before the existing terminal status="failed" line: when recoverable=False, self.is_persistent=True, AND isinstance(error, RecoverableFailure), set status="pending", bump retry_count, compute next_retry_at via _calculate_next_retry(), log the scheduled retry via log_line(..., save=False) (matching the existing terminal branch - the caller's save() flushes everything), and return.
-
The isinstance(error, RecoverableFailure) guard is defense-in-depth. Today the handler is only invoked from two places in upgrade() (base/models.py:930 after exhausting all DeviceConnections, and :979 from the except RecoverableFailure arm), both already passing a RecoverableFailure. The other failure types in upgrade() (ReconnectionFailed, UpgradeAborted, UpgradeCancelled, and the bare Exception arm at :982) set their terminal status directly and never reach the handler. The isinstance check makes that intent explicit - if a future change introduces a third call site passing a different exception type, persistent retry still won't fire for it.
-
Add a _calculate_next_retry() method on AbstractUpgradeOperation. Since bullet 1 increments retry_count first, the math is:
delay = min(base * (multiplier ** (self.retry_count - 1)), max_delay)
jittered = delay * random.uniform(1 - jitter, 1 + jitter)
return timezone.now() + timedelta(seconds=jittered)
With retry_count=1 and jitter=0, the first delay is base (10 minutes by default).
-
Four configurable settings via the existing getattr(settings, "OPENWISP_FIRMWARE_UPGRADER_*", default) pattern:
| Setting |
Default |
Purpose |
..._PERSISTENT_RETRY_BASE_DELAY |
600 (10 min) |
First retry delay |
..._PERSISTENT_RETRY_MULTIPLIER |
2 |
Per-retry growth factor |
..._PERSISTENT_RETRY_JITTER |
0.25 |
±25% randomization to keep fleet retries from synchronizing |
..._PERSISTENT_RETRY_MAX_DELAY |
43200 (12h) |
Per-retry cap |
-
Sequence without jitter: ~10m → ~20m → ~40m → ~80m → ~2.7h → ~5.3h → ~10.7h → 12h, then 12h thereafter until the admin cancels or the device gets deactivated (sub-issue 04's retry_pending_upgrade checks is_deactivated() and flips to failed).
-
The two save() call sites in upgrade() (base/models.py:939 after the connection-failure handler, and the final save() at :1001 after the RecoverableFailure arm) are bare self.save() with no update_fields= restriction, so the new persistence fields land in the same write as status.
-
Unit tests covering:
- Existing recoverable path unchanged (regression).
- Persistent + handler called with
RecoverableFailure → status="pending", retry_count bumped, next_retry_at in the future.
- Persistent + handler called with a non-
RecoverableFailure exception → status="failed" (validates the isinstance guard).
- Non-persistent + handler called →
status="failed" (regression).
- Other failure paths in
upgrade() (ReconnectionFailed, UpgradeAborted, UpgradeCancelled, bare Exception) still hit their terminal status even when the op has is_is_persistent=True - confirms call-site routing keeps them out of the handler.
- Backoff math: delays land in
[base * 2^(n-1) * (1-jitter), base * 2^(n-1) * (1+jitter)] for retry_count 1–7; cap kicks in at retry_count=8.
- All four settings overrides honoured.
Is your feature request related to a problem? Please describe.
Today
_recoverable_failure_handleronAbstractUpgradeOperationhas exactly two branches: re-raise the exception (Celery auto-retries up to 4 times viaRETRY_OPTIONSinsettings.py, ~10 minutes of attempts total) or setstatus="failed"and walk away. There's no middle ground - "device is offline; preserve the operation and try again later" has nowhere to live. Ten minutes is plenty for a flaky link or transient SSH timeout, but useless for a warehouse device that's off overnight.What I need to add is a third branch: when the handler is called and the operation has
is_is_persistent=True, send it topendinginstead offailed, schedule a retry with randomized exponential backoff (capped at 12 hours), and let sub-issue 04's Beat task pick it back up later.Describe the solution I would implement
I would like to wire the new fields from sub-issue 01 into the failure handler so persistent operations land in
pendinginstead offailed.Insert a new branch in
_recoverable_failure_handler(base/models.py:896–904), right before the existing terminalstatus="failed"line: whenrecoverable=False,self.is_persistent=True, ANDisinstance(error, RecoverableFailure), setstatus="pending", bumpretry_count, computenext_retry_atvia_calculate_next_retry(), log the scheduled retry vialog_line(..., save=False)(matching the existing terminal branch - the caller'ssave()flushes everything), and return.The
isinstance(error, RecoverableFailure)guard is defense-in-depth. Today the handler is only invoked from two places inupgrade()(base/models.py:930after exhausting allDeviceConnections, and:979from theexcept RecoverableFailurearm), both already passing aRecoverableFailure. The other failure types inupgrade()(ReconnectionFailed,UpgradeAborted,UpgradeCancelled, and the bareExceptionarm at:982) set their terminal status directly and never reach the handler. The isinstance check makes that intent explicit - if a future change introduces a third call site passing a different exception type, persistent retry still won't fire for it.Add a
_calculate_next_retry()method onAbstractUpgradeOperation. Since bullet 1 incrementsretry_countfirst, the math is:With
retry_count=1andjitter=0, the first delay isbase(10 minutes by default).Four configurable settings via the existing
getattr(settings, "OPENWISP_FIRMWARE_UPGRADER_*", default)pattern:..._PERSISTENT_RETRY_BASE_DELAY..._PERSISTENT_RETRY_MULTIPLIER..._PERSISTENT_RETRY_JITTER..._PERSISTENT_RETRY_MAX_DELAYSequence without jitter: ~10m → ~20m → ~40m → ~80m → ~2.7h → ~5.3h → ~10.7h → 12h, then 12h thereafter until the admin cancels or the device gets deactivated (sub-issue 04's
retry_pending_upgradechecksis_deactivated()and flips tofailed).The two
save()call sites inupgrade()(base/models.py:939after the connection-failure handler, and the finalsave()at:1001after theRecoverableFailurearm) are bareself.save()with noupdate_fields=restriction, so the new persistence fields land in the same write asstatus.Unit tests covering:
RecoverableFailure→status="pending",retry_countbumped,next_retry_atin the future.RecoverableFailureexception →status="failed"(validates the isinstance guard).status="failed"(regression).upgrade()(ReconnectionFailed,UpgradeAborted,UpgradeCancelled, bareException) still hit their terminal status even when the op hasis_is_persistent=True- confirms call-site routing keeps them out of the handler.[base * 2^(n-1) * (1-jitter), base * 2^(n-1) * (1+jitter)]forretry_count1–7; cap kicks in atretry_count=8.