[feature] Persistent Mass Upgrades

Currently, firmware upgrades in OpenWISP happen immediately via Celery tasks.
If a device is offline at the moment of glory, the task fails, and someone has to manually retry it.

This becomes a nightmare in large deployments: admins can't just schedule upgrades and walk away.

What our users want: **mass upgrades that patiently wait for devices to come back online and then do their job automatically**.

## Describe the solution you'd like

<img width="1024" height="1536" alt="Image" src="https://github.com/user-attachments/assets/460214fa-6cee-4945-a688-94df92db8a61" />

Introduce support for **persistent upgrade tasks** that stick around until offline devices finally wake up and get upgraded.

This change is mainly aimed at **mass upgrade operations** for now; But it should be possible to trigger persistent single device upgrades too, for example, we could support doing this only via REST API and Python scripting for now, and we'll implement the UI for persistent single upgrades in a future iteration.

The system should ask users whether they want the mass upgrade to retry indefinitely, with the default option suggested as enabled (checked by default in both the admin interface and REST API).
This will likely require a new boolean database column, for example `persistent` (yes/no), which should be visible in the admin and REST API lists for mass upgrade operations.
Once a mass upgrade operation has been launched, this field **must not** be changeable.

Suggested implementation detail:

* The persistence decision should live on the per-device `UpgradeOperation`, so standalone upgrades can opt in through REST API or Python scripting.
* `BatchUpgradeOperation` should also expose `persistent` as the mass-upgrade policy, defaulting to enabled, and propagate its value to child `UpgradeOperation` records when they are created.
* Standalone single-device upgrades should keep the current fail-fast behavior by default unless `persistent=True` is explicitly requested.

Some ideas and guidelines for contributors (but feel free to do your own research and come up with better ideas if deemed proper):

* **Persistent task records**

  * Use the existing `UpgradeOperation` model for most of the data.
  * Add `retry_count` and optionally `scheduled_time` to track when the next retry is due.
  * Consider naming the retry datetime field `next_retry_at` rather than `scheduled_time`, to avoid confusion with scheduled mass upgrades.
  * Add a `pending` status to represent an upgrade operation waiting for the device to come back online.
  * These tasks must survive system restarts and database migrations.

* **Device online detection**

  * Prefer using the `health_status_changed` signal from OpenWISP Monitoring (mock it in CI/testing).
  * Fallback: periodic retries with **randomized exponential backoff** (configurable, max once every 12 hours).

* **Retry strategy**

  * Randomized exponential backoff with indefinite retries.
  * Persistence should apply only to failures where the upgrade could not really start because the device was unreachable/offline. Other failures, such as checksum errors or failures after flashing started, should continue to require manual review.
  * We need a periodic reminder, the default period can be 2 months. When the end of the period is reached, we shall notify the admins via `generic_notification` about devices still waiting for upgrade, we need to link the notification to the mass upgrade operation, let's make sure the link automatically filters devices still pending upgrade (this should be already implemented and should be a matter of using the right URL, but let's double check).
  * Reminders repeat periodically until the admin cancels the operation or all devices are upgraded.

* **Integration with Celery**

  * Existing upgrade operations already run as Celery tasks.
  * Introduce a new task (or tasks) to “wake up” pending upgrades for mass upgrades.
  * Randomized retry delays prevent all upgrades from running simultaneously and overloading the system.

* **Failure handling & notifications**

  * Failures that need human attention: devices offline too long, upgrade errors, checksum issues.
  * Use `generic_notification` to inform the admins.
  * Failed upgrades require manual review; automatic retries after fixes are optional.

* **Edge cases**

  * If `health_status_changed` signal trigges the wake up of an upgrade opration while the upgrade operation is seemingly already woken up → ignore and do nothing.
  * Retrying a pending operation must be idempotent. Use an atomic transition from `pending` to `in-progress` so concurrent signal/periodic wake-ups cannot dispatch duplicate retries.
  * If a device is deactivated while an operation is pending, the pending operation should stop retrying and be marked as failed.
  * Existing queue/conflict logic should be verified and updated where needed so `pending` operations are treated as active conflicts, not as completed/failed operations.
  * Only one upgrade per device is allowed; queue order and conflicts area already handled and no changes should be needed.
  * No rollback support post-flashing as it's not technically possible and firmware conflicts already managed by existing logic.

* **Scalability**

  * No hard-coded limit on pending tasks; system scales with available workers.
  * Randomized backoff prevents broker/database overload.
  * No batching required; retries remain random for now.

* **Metrics & observability**

  * Track retry counts for upgrade oprations (visible in admin/REST API).
  * Expose `pending` status in admin filters and REST API filters.
  * Ensure batch progress and WebSocket/admin UI handle `pending` correctly: pending operations must not be counted as completed, and the UI should show that they are waiting for retry.
  * Failures already handled by existing logic.
  * Minimal admin UI and REST API exposure for filtering pending upgrade operations that are going to be retried.

The nitty-gritty of how to schedule retries, detect online devices, and handle notifications is left for contributors to explore.
**Safety and reliability is key**.

## Describe alternatives you've considered

* Retry failed tasks immediately → nope, offline devices would never get upgraded.
* Manual re-triggering → scales about as well as a greased pig on ice.

## Additional context

* This feature is all about making mass upgrades less of a headache and more like magic.
* Contributors are encouraged to explore different strategies for persistent queuing, backoff logic, and Celery integration.
* Consider edge cases like very long offline periods, and safe notification handling.

## Constraints

* Test coverage **must not decrease**.
* Basic browser tests for UI related features are **required**.
* Documentation needs to be updated to include this new feature, including updating any existing screenshots that may change after implementation.
* We also need a short example usage video for YouTube that we can showcase on the website/documentation.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[feature] Persistent Mass Upgrades #379

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Constraints

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[feature] Persistent Mass Upgrades #379

Description

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Constraints

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions