|
| 1 | +Persistent Mass Upgrades |
| 2 | +======================== |
| 3 | + |
| 4 | +When a mass upgrade runs against a large fleet, some devices are usually |
| 5 | +offline at that moment. Without persistence, each unreachable device ends |
| 6 | +as ``failed`` once the immediate retries are exhausted, leaving the |
| 7 | +operator to track down and re-launch every failed device by hand. |
| 8 | + |
| 9 | +A *persistent* mass upgrade does not give up on offline devices. Instead |
| 10 | +of marking them ``failed``, it parks them in the ``pending`` state with a |
| 11 | +scheduled retry time and keeps retrying in the background until the device |
| 12 | +comes back online or the operation is cancelled. |
| 13 | + |
| 14 | +.. contents:: **Table of contents**: |
| 15 | + :depth: 2 |
| 16 | + :local: |
| 17 | + |
| 18 | +How it works |
| 19 | +------------ |
| 20 | + |
| 21 | +An operation whose device is unreachable transitions to ``pending`` |
| 22 | +instead of ``failed``, with an incremented ``retry_count`` and an |
| 23 | +exponential-backoff ``next_retry_at``. A periodic Celery Beat task |
| 24 | +re-dispatches pending operations once their retry time has elapsed, and |
| 25 | +the batch stays ``in-progress`` until every device has either upgraded or |
| 26 | +been cancelled. |
| 27 | + |
| 28 | +.. image:: https://raw.githubusercontent.com/openwisp/openwisp-firmware-upgrader/docs/docs/images/1.4/persistent-mass-upgrade-batch.png |
| 29 | + :target: https://raw.githubusercontent.com/openwisp/openwisp-firmware-upgrader/docs/docs/images/1.4/persistent-mass-upgrade-batch.png |
| 30 | + |
| 31 | +The mass-upgrade page above stays ``in progress`` while one device is |
| 32 | +still ``pending``, reporting ``2 complete, 1 pending`` and keeping the |
| 33 | +batch open until the offline device is retried successfully or cancelled. |
| 34 | + |
| 35 | +See :doc:`upgrade-status` for the full operation state machine and the |
| 36 | +meaning of the ``pending`` state. |
| 37 | + |
| 38 | +Enabling from the admin |
| 39 | +----------------------- |
| 40 | + |
| 41 | +On the mass-upgrade confirmation page (reached from a build's *Upgrade* |
| 42 | +action) the **persistent** checkbox is shown pre-checked. Leave it checked |
| 43 | +to keep retrying offline devices, or uncheck it to fall back to the |
| 44 | +behaviour where unreachable devices end as ``failed``. |
| 45 | + |
| 46 | +.. image:: https://raw.githubusercontent.com/openwisp/openwisp-firmware-upgrader/docs/docs/images/1.4/persistent-mass-upgrade-confirm.png |
| 47 | + :target: https://raw.githubusercontent.com/openwisp/openwisp-firmware-upgrader/docs/docs/images/1.4/persistent-mass-upgrade-confirm.png |
| 48 | + |
| 49 | +The flag is locked in once the mass upgrade leaves the ``idle`` state, so |
| 50 | +it cannot be changed midway through a running batch. |
| 51 | + |
| 52 | +Enabling via the REST API |
| 53 | +~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 54 | + |
| 55 | +The mass-upgrade endpoint accepts an ``is_persistent`` field that defaults |
| 56 | +to ``true``; the single-device upgrade endpoint accepts the same field but |
| 57 | +defaults to ``false``. See :doc:`rest-api` for the full request and |
| 58 | +response reference. |
| 59 | + |
| 60 | +Finding pending operations |
| 61 | +-------------------------- |
| 62 | + |
| 63 | +Pending operations are listed in the upgrade-operation admin and can be |
| 64 | +isolated with the ``status`` filter set to ``pending``. The list shows the |
| 65 | +``persistent`` flag and the ``retry_count`` column, the latter being how |
| 66 | +many times an operation has been retried so far. |
| 67 | + |
| 68 | +.. image:: https://raw.githubusercontent.com/openwisp/openwisp-firmware-upgrader/docs/docs/images/1.4/persistent-upgrade-pending-changelist.png |
| 69 | + :target: https://raw.githubusercontent.com/openwisp/openwisp-firmware-upgrader/docs/docs/images/1.4/persistent-upgrade-pending-changelist.png |
| 70 | + |
| 71 | +An operation's detail page adds ``next_retry_at`` (when the next attempt |
| 72 | +is scheduled) and a log that records each attempt, ending with the |
| 73 | +backoff-scheduled ``persistent retry`` line for the next run. |
| 74 | + |
| 75 | +.. image:: https://raw.githubusercontent.com/openwisp/openwisp-firmware-upgrader/docs/docs/images/1.4/persistent-upgrade-operation-pending.png |
| 76 | + :target: https://raw.githubusercontent.com/openwisp/openwisp-firmware-upgrader/docs/docs/images/1.4/persistent-upgrade-operation-pending.png |
| 77 | + |
| 78 | +Cancelling a pending operation |
| 79 | +------------------------------ |
| 80 | + |
| 81 | +A pending operation is still active, so it can be cancelled the same way |
| 82 | +as an in-progress one — from the admin cancel button or the REST cancel |
| 83 | +endpoint. Cancelling stops the retry loop and moves the operation to |
| 84 | +``cancelled``. A pending operation cannot be *deleted* until it reaches a |
| 85 | +terminal state (see :ref:`deleting_upgrade_operations`). |
| 86 | + |
| 87 | +Notifications |
| 88 | +------------- |
| 89 | + |
| 90 | +Two notifications keep operators informed about long-running persistent |
| 91 | +upgrades: |
| 92 | + |
| 93 | +- a **reminder** fires when a persistent batch still has pending children |
| 94 | + after the configured cadence has elapsed, and |
| 95 | +- a **failure** notification fires when a persistent operation finally |
| 96 | + ends as ``failed`` (for example, the device was deactivated while |
| 97 | + pending). |
| 98 | + |
| 99 | +.. image:: https://raw.githubusercontent.com/openwisp/openwisp-firmware-upgrader/docs/docs/images/1.4/persistent-upgrade-notifications.png |
| 100 | + :target: https://raw.githubusercontent.com/openwisp/openwisp-firmware-upgrader/docs/docs/images/1.4/persistent-upgrade-notifications.png |
| 101 | + |
| 102 | +The cadence and related settings are documented in :doc:`settings`. |
| 103 | + |
| 104 | +Behaviour with and without openwisp-monitoring |
| 105 | +---------------------------------------------- |
| 106 | + |
| 107 | +Persistent upgrades work with Celery Beat alone: the periodic scan retries |
| 108 | +due pending operations on a fixed cadence. Installing |
| 109 | +``openwisp-monitoring`` adds a faster wake-up path — a device returning to |
| 110 | +a healthy state triggers its pending retries immediately, without waiting |
| 111 | +for the next scan. When ``openwisp-monitoring`` is not installed, the Beat |
| 112 | +scan remains the only retry trigger. |
| 113 | + |
| 114 | +The periodic tasks (``check_pending_upgrades`` and |
| 115 | +``send_pending_upgrade_reminders``) must be present in the deployment's |
| 116 | +``CELERY_BEAT_SCHEDULE``; see :doc:`settings`. |
0 commit comments