Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
34 commits
Select commit Hold shift + click to select a range
8e64afe
[feature] Added persistence model fields and batch-to-child propagati…
Eeshu-Yadav May 22, 2026
da23c61
[feature] Added pending status and migration for persistence schema f…
Eeshu-Yadav May 22, 2026
22991e2
[feature] Locked is_persistent flag after launch via clean() guard #417
Eeshu-Yadav May 22, 2026
6228b83
[tests] Added tests for persistence model fields, propagation and imm…
Eeshu-Yadav May 23, 2026
59bb81d
[feature] Added persistent retry branch with exponential backoff to f…
Eeshu-Yadav May 26, 2026
ee09bd2
[tests] Added tests for persistent retry branch and backoff calculati…
Eeshu-Yadav May 26, 2026
972cf25
[change] Wired pending status into batch aggregation, cancellation an…
Eeshu-Yadav May 26, 2026
fadfaf0
[tests] Added mixed success+pending batch tests for pending state coh…
Eeshu-Yadav May 26, 2026
851c6be
[change] Added pending counter, pending_count property and X complete…
Eeshu-Yadav May 26, 2026
ff45bb0
[change] Fixed WebSocket consumer snapshots to handle pending status …
Eeshu-Yadav May 26, 2026
379a5f8
[change] Extended WebSocket batch payload with pending count and X co…
Eeshu-Yadav May 26, 2026
5c959e2
[tests] Added pending_count and WebSocket consumer snapshot regressio…
Eeshu-Yadav May 26, 2026
98dd4b8
[tests] Asserted ValidationError messages and verified next_retry_at …
Eeshu-Yadav May 27, 2026
2f1cf28
[tests] Added override_settings, retry_count=0 guard and non-Recovera…
Eeshu-Yadav May 27, 2026
0303cb2
[tests] Strengthened concurrent-guard test and added WebSocket push-p…
Eeshu-Yadav May 27, 2026
b5e6d6d
[feature] Added Beat scanner and retry worker for persistent pending …
Eeshu-Yadav May 28, 2026
21c2dc1
[tests] Added tests for persistent retry pipeline #424
Eeshu-Yadav May 28, 2026
30874c5
[change] Extracted is_persistent immutability check and consolidated …
Eeshu-Yadav Jun 1, 2026
f368599
[docs] Documented persistent mass upgrades, pending status and new se…
Eeshu-Yadav Jun 1, 2026
5ebcf4c
[tests] Added Selenium assertion for the pending-branch of the batch …
Eeshu-Yadav Jun 1, 2026
c5ca1d2
[feature] Added monitoring health_status_changed handler for fast per…
Eeshu-Yadav Jun 1, 2026
bef4c28
[fix] Handle deleted-op race in retry_pending_upgrade #424
Eeshu-Yadav Jun 1, 2026
9a15d93
[feature] Added pending-upgrade reminders and failure-needs-attention…
Eeshu-Yadav Jun 1, 2026
85fea73
[feature] Added is_persistent checkbox and pending-state admin surfac…
Eeshu-Yadav Jun 1, 2026
b753630
[fix] Registered dedicated notification types for persistent-upgrade …
Eeshu-Yadav Jun 4, 2026
ee189d6
[fix] Render pending and failed progress bars correctly #423
Eeshu-Yadav Jun 4, 2026
4a32524
[fix] Extend deletion guard in admin to also block pending operations…
Eeshu-Yadav Jun 4, 2026
a8f2845
[tests] Use mock.patch.object in is_persistent tests #417
Eeshu-Yadav Jun 4, 2026
1cd884c
[feature] Added is_persistent, retry_count and next_retry_at to REST …
Eeshu-Yadav Jun 4, 2026
bdd6a17
Merge branch 'gsoc26-persistent-scheduled-upgrades' into issues/417-p…
nemesifier Jun 5, 2026
baeef7c
[fix] Address review feedback on persistent upgrades #379
Eeshu-Yadav Jun 8, 2026
b404531
[tests] Add persistence retry-loop, time_travel helper and Selenium t…
Eeshu-Yadav Jun 9, 2026
191a1c8
[fix] Address review feedback on persistent upgrades #379
Eeshu-Yadav Jun 23, 2026
8c60dc3
[docs] Add Persistent Mass Upgrades page #379
Eeshu-Yadav Jun 23, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,7 @@ within the OpenWISP architecture.
./user/intro.rst
./user/quickstart.rst
./user/upgrade-status.rst
./user/persistent-mass-upgrades.rst
./user/automatic-device-firmware-detection.rst
./user/custom-firmware-upgrader.rst
./user/rest-api.rst
Expand Down
3 changes: 3 additions & 0 deletions docs/user/intro.rst
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,9 @@ Firmware Upgrader: Features
- Single device upgrade
- Mass upgrades with possibility of filtering by device group and/or
geographic location
- Persistent mass upgrades that keep retrying offline devices in the
background until they come online (see :doc:`upgrade-status` and
:doc:`settings`)
- Possibility to divide firmware images in categories
- :doc:`REST API <rest-api>`
- :doc:`Possibility of writing custom upgraders
Expand Down
119 changes: 119 additions & 0 deletions docs/user/persistent-mass-upgrades.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,119 @@
Persistent Mass Upgrades
========================

When a mass upgrade runs against a large fleet, some devices are usually
offline at that moment. Without persistence, each unreachable device ends
as ``failed`` once the immediate retries are exhausted, leaving the
operator to track down and re-launch every failed device by hand.

A *persistent* mass upgrade does not give up on offline devices. Instead
of marking them ``failed``, it parks them in the ``pending`` state with a
scheduled retry time and keeps retrying in the background until the device
comes back online or the operation is cancelled.

.. contents:: **Table of contents**:
:depth: 2
:local:

How it works
------------

An operation whose device is unreachable transitions to ``pending``
instead of ``failed``, with an incremented ``retry_count`` and an
exponential-backoff ``next_retry_at`` (10 minutes, doubling on each retry
up to a 12-hour cap, with ±25% jitter). A periodic Celery Beat task
re-dispatches pending operations once their retry time has elapsed, and
the batch stays ``in-progress`` until every device has either upgraded or
been cancelled.

.. image:: https://raw.githubusercontent.com/openwisp/openwisp-firmware-upgrader/docs/docs/images/1.4/persistent-mass-upgrade-batch.png
:target: https://raw.githubusercontent.com/openwisp/openwisp-firmware-upgrader/docs/docs/images/1.4/persistent-mass-upgrade-batch.png

The mass-upgrade page above stays ``in progress`` while one device is
still ``pending``, reporting ``2 complete, 1 pending`` and keeping the
batch open until the offline device is retried successfully or cancelled.

See :doc:`upgrade-status` for the full operation state machine and the
meaning of the ``pending`` state.

Enabling from the admin
-----------------------

On the mass-upgrade confirmation page (reached from a build's *Upgrade*
action) the **persistent** checkbox is shown pre-checked. Leave it checked
to keep retrying offline devices, or uncheck it to fall back to the
behaviour where unreachable devices end as ``failed``.

.. image:: https://raw.githubusercontent.com/openwisp/openwisp-firmware-upgrader/docs/docs/images/1.4/persistent-mass-upgrade-confirm.png
:target: https://raw.githubusercontent.com/openwisp/openwisp-firmware-upgrader/docs/docs/images/1.4/persistent-mass-upgrade-confirm.png

The flag is locked in once the mass upgrade leaves the ``idle`` state, so
it cannot be changed midway through a running batch.

Enabling via the REST API
~~~~~~~~~~~~~~~~~~~~~~~~~

The mass-upgrade endpoint accepts an ``is_persistent`` field that defaults
to ``true``; the single-device upgrade endpoint accepts the same field but
defaults to ``false``. See :doc:`rest-api` for the full request and
response reference.

Finding pending operations
--------------------------

Pending operations are listed in the upgrade-operation admin and can be
isolated with the ``status`` filter set to ``pending``. The list shows the
``persistent`` flag and the ``retry_count`` column, the latter being how
many times an operation has been retried so far.

.. image:: https://raw.githubusercontent.com/openwisp/openwisp-firmware-upgrader/docs/docs/images/1.4/persistent-upgrade-pending-changelist.png
:target: https://raw.githubusercontent.com/openwisp/openwisp-firmware-upgrader/docs/docs/images/1.4/persistent-upgrade-pending-changelist.png

An operation's detail page adds ``next_retry_at`` (when the next attempt
is scheduled) and a log that records each attempt, ending with the
backoff-scheduled ``persistent retry`` line for the next run.

.. image:: https://raw.githubusercontent.com/openwisp/openwisp-firmware-upgrader/docs/docs/images/1.4/persistent-upgrade-operation-pending.png
:target: https://raw.githubusercontent.com/openwisp/openwisp-firmware-upgrader/docs/docs/images/1.4/persistent-upgrade-operation-pending.png

Cancelling a pending operation
------------------------------

A pending operation is still active, so it can be cancelled the same way
as an in-progress one — from the admin cancel button or the REST cancel
endpoint. Cancelling stops the retry loop and moves the operation to
``cancelled``. A pending operation cannot be *deleted* until it reaches a
terminal state (see :ref:`deleting_upgrade_operations`).

Notifications
-------------

Two notifications keep operators informed about long-running persistent
upgrades:

- a **reminder** fires when a persistent batch still has pending children
after the configured cadence has elapsed, and
- a **failure** notification fires when a persistent operation finally
ends as ``failed`` (for example, the device was deactivated while
pending).

Both are delivered to the organization's administrators (and superusers).

.. image:: https://raw.githubusercontent.com/openwisp/openwisp-firmware-upgrader/docs/docs/images/1.4/persistent-upgrade-notifications.png
:target: https://raw.githubusercontent.com/openwisp/openwisp-firmware-upgrader/docs/docs/images/1.4/persistent-upgrade-notifications.png

The cadence and related settings are documented in :doc:`settings`.

Behaviour with and without openwisp-monitoring
----------------------------------------------

Persistent upgrades work with Celery Beat alone: the periodic scan retries
due pending operations on a fixed cadence. Installing
``openwisp-monitoring`` adds a faster wake-up path — a device returning to
a healthy state triggers its pending retries immediately, without waiting
for the next scan. When ``openwisp-monitoring`` is not installed, the Beat
scan remains the only retry trigger.

The periodic tasks (``check_pending_upgrades`` and
``send_pending_upgrade_reminders``) must be present in the deployment's
``CELERY_BEAT_SCHEDULE``; see :doc:`settings`.
11 changes: 9 additions & 2 deletions docs/user/rest-api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -219,6 +219,8 @@ the request body:
specific group
- ``location`` (Location ID): limit the upgrade to devices at a specific
geographic location
- ``is_persistent`` (boolean, default ``true``): keep retrying offline
devices until they come back online or the operation is cancelled

Example with filters:

Expand Down Expand Up @@ -318,7 +320,8 @@ The list of upgrade operations provides the following filters:
- ``device__organization_slug`` (Organization slug of the device)
- ``device`` (Device ID)
- ``image`` (Firmware image ID)
- ``status`` (One of: in-progress, success, failed, aborted, cancelled)
- ``status`` (One of: in-progress, pending, success, failed, aborted,
cancelled)

Here's a few examples:

Expand Down Expand Up @@ -359,7 +362,7 @@ List Device Upgrade Operations
**Available filters**

The list of device upgrade operations can be filtered by ``status`` (one
of: in-progress, success, failed, aborted, cancelled).
of: in-progress, pending, success, failed, aborted, cancelled).

.. code-block:: text

Expand All @@ -375,6 +378,10 @@ firmware if it does not already exist.

PUT /api/v1/firmware-upgrader/device/{device_id}/firmware/

The request body accepts an optional ``is_persistent`` (boolean, default
``false``) flag; when enabled, the resulting upgrade keeps retrying the
device until it comes back online or the operation is cancelled.

Get Device Firmware Details
~~~~~~~~~~~~~~~~~~~~~~~~~~~

Expand Down
73 changes: 69 additions & 4 deletions docs/user/settings.rst
Original file line number Diff line number Diff line change
Expand Up @@ -36,14 +36,14 @@ documentation regarding automatic retries for known errors
``OPENWISP_FIRMWARE_UPGRADER_TASK_TIMEOUT``
-------------------------------------------

============ =======
============ ========
**type**: ``int``
**default**: ``600``
============ =======
**default**: ``1500``
============ ========

Timeout for the background tasks which perform firmware upgrades.

If for some unexpected reason an upgrade remains stuck for more than 10
If for some unexpected reason an upgrade remains stuck for more than 25
minutes, the upgrade operation will be flagged as failed and the task will
be killed.

Expand All @@ -54,6 +54,71 @@ the available slots in a background queue and prevent other tasks from
being executed, which will end up affecting negatively the rest of the
application.

``OPENWISP_FIRMWARE_UPGRADER_PERSISTENT_RETRY_OPTIONS``
-------------------------------------------------------

============ =========
**type**: ``dict``
**default**: see below
============ =========

.. code-block:: python

# default value of OPENWISP_FIRMWARE_UPGRADER_PERSISTENT_RETRY_OPTIONS:

dict(
base_delay=600,
multiplier=2,
jitter=0.25,
max_delay=43200,
dispatch_jitter=300,
signal_jitter=120,
)

Backoff settings for persistent retries.

When an upgrade operation has its ``is_persistent`` flag set and the
device is unreachable, the operation transitions to ``pending`` rather
than ``failed``. ``next_retry_at`` is then scheduled using the values in
this dict:

- ``base_delay`` (seconds): delay before the first persistent retry.
- ``multiplier``: exponential factor applied per retry. With the defaults
the delays grow 10m → 20m → 40m → ...
- ``jitter`` (0–1): random fraction added or subtracted from each delay,
so retries for many devices don't all fire at the same instant.
- ``max_delay`` (seconds): upper bound for any single retry delay.
- ``dispatch_jitter`` (seconds): when the Beat scanner fans out a batch of
due retries, each one is delayed by a random ``[0, dispatch_jitter]``
interval so the worker isn't slammed all at once.
- ``signal_jitter`` (seconds): same idea as ``dispatch_jitter`` but for
the openwisp-monitoring ``health_status_changed`` wake-up path: when a
network outage recovers and many devices come back online together, each
pending op's retry is delayed by a random ``[0, signal_jitter]``
interval. Smaller than ``dispatch_jitter`` because the signal wake-up is
meant to feel fast. Has no effect when ``openwisp-monitoring`` is not
installed.

``OPENWISP_FIRMWARE_UPGRADER_PERSISTENT_REMINDER_PERIOD``
---------------------------------------------------------

============ =====================
**type**: ``int``
**default**: ``5184000`` (60 days)
============ =====================

Seconds between consecutive reminders for a single persistent batch that
still has pending children. The first reminder fires when the batch is
older than this period; subsequent reminders fire when the same period has
elapsed since the previous send. The reminder itself goes out as a
``pending_upgrade_reminder`` notification to the batch's organization
admins and all superusers.

The Beat task that drives these reminders
(``send_pending_upgrade_reminders``) is registered in the deployment's own
``CELERY_BEAT_SCHEDULE``; see the docker-openwisp and ansible-openwisp2
recipes for the snippet.

.. _openwisp_custom_openwrt_images:

``OPENWISP_CUSTOM_OPENWRT_IMAGES``
Expand Down
39 changes: 36 additions & 3 deletions docs/user/upgrade-status.rst
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,33 @@ file upload, and firmware flashing.
progress, but only before the firmware flashing phase begins (typically
when progress is below 65%).

Pending
~~~~~~~

**Status**: ``pending``

**Description**: The device was unreachable when the upgrade was last
attempted. The operation keeps a future ``next_retry_at`` and a Celery
Beat task picks it up later. This is the status that persistent mass
upgrades use while a device is offline.

**What happens during this status:**

- ``retry_count`` is incremented and ``next_retry_at`` is scheduled with
an exponential backoff (10m → 20m → 40m → ..., capped at 12 hours, with
±25% jitter)
- A periodic Beat task scans for pending operations whose
``next_retry_at`` has elapsed and re-dispatches them
- A device deactivated while pending is set to ``failed`` and not retried

**User Actions**: Pending operations can be cancelled the same way as

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This new section says pending operations can be cancelled, but the cancellation section later still says cancellation is possible only while the status is in-progress. I would update that bullet to mention both in-progress and pending.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated the cancellation bullet and the cancel-button line to mention both in-progress and pending.

in-progress ones, both from the admin and the REST API. Starting another
upgrade on the same device is blocked while one is pending, so the device
cannot be flashed twice. ``pending`` is treated as an active, non-terminal
state by the deletion guard: a pending operation cannot be deleted
directly and must be cancelled or left to reach a terminal state first
(see :ref:`Deleting Upgrade Operations <deleting_upgrade_operations>`).

Success
~~~~~~~

Expand Down Expand Up @@ -109,14 +136,15 @@ before completion. This is a deliberate action taken through the admin
interface or REST API.

Users can cancel upgrades through the admin interface using the "Cancel"
button that appears next to in-progress operations.
button that appears next to in-progress and pending operations.

**When cancellation is possible:**

- During the early stages of upgrade (typically before 65% progress)
- Before the new firmware image is written to the flash memory of the
network device
- While the operation status is still "in-progress"
- While the operation status is still ``in-progress`` or ``pending`` (a
pending operation can be cancelled to stop its persistent retries)

**What happens when the upgrade operation is cancelled:**

Expand Down Expand Up @@ -184,11 +212,16 @@ about what occurred during the upgrade process.
**Batch Operations**: When performing mass upgrades, you can monitor the
status of individual device upgrades within the batch operation.

.. _deleting_upgrade_operations:

Deleting Upgrade Operations
---------------------------

Upgrade operations and batch upgrade operations can be deleted from the
admin interface only after they leave the ``in-progress`` state.
admin interface only after they leave the ``in-progress`` state. The
``pending`` state is guarded the same way: a pending operation is still
active (it is waiting to be retried), so it cannot be deleted until it is
cancelled or reaches a terminal state.

Deleting an operation while it is still running is intentionally blocked
because the upgrade may be uploading or flashing a firmware image,
Expand Down
2 changes: 2 additions & 0 deletions docs/user/websocket-api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -136,6 +136,7 @@ exactly one message:
"batch_status": {
"status": "<string>", // Overall batch status
"completed": <integer>, // Number of completed operations
"pending": <integer>, // Number of operations awaiting a persistent retry
"total": <integer> // Total operations in the batch
},
"operations": [
Expand Down Expand Up @@ -180,6 +181,7 @@ The endpoint may push:
"type": "batch_status", // Message type identifier
"status": "<string>", // Overall batch status
"completed": <integer>, // Number of completed operations
"pending": <integer>, // Number of operations awaiting a persistent retry
"total": <integer> // Total operations in the batch
}

Expand Down
Loading
Loading