Skip to content

[feature] Persistent Mass Upgrades #379

@nemesifier

Description

@nemesifier

Currently, firmware upgrades in OpenWISP happen immediately via Celery tasks.
If a device is offline at the moment of glory, the task fails, and someone has to manually retry it.

This becomes a nightmare in large deployments: admins can't just schedule upgrades and walk away.

What our users want: mass upgrades that patiently wait for devices to come back online and then do their job automatically.

Describe the solution you'd like

Image

Introduce support for persistent upgrade tasks that stick around until offline devices finally wake up and get upgraded.

This change is mainly aimed at mass upgrade operations for now; But it should be possible to trigger persistent single device upgrades too, for example, we could support doing this only via REST API and Python scripting for now, and we'll implement the UI for persistent single upgrades in a future iteration.

The system should ask users whether they want the mass upgrade to retry indefinitely, with the default option suggested as enabled (checked by default in both the admin interface and REST API).
This will likely require a new boolean database column, for example persistent (yes/no), which should be visible in the admin and REST API lists for mass upgrade operations.
Once a mass upgrade operation has been launched, this field must not be changeable.

Suggested implementation detail:

  • The persistence decision should live on the per-device UpgradeOperation, so standalone upgrades can opt in through REST API or Python scripting.
  • BatchUpgradeOperation should also expose persistent as the mass-upgrade policy, defaulting to enabled, and propagate its value to child UpgradeOperation records when they are created.
  • Standalone single-device upgrades should keep the current fail-fast behavior by default unless persistent=True is explicitly requested.

Some ideas and guidelines for contributors (but feel free to do your own research and come up with better ideas if deemed proper):

  • Persistent task records

    • Use the existing UpgradeOperation model for most of the data.
    • Add retry_count and optionally scheduled_time to track when the next retry is due.
    • Consider naming the retry datetime field next_retry_at rather than scheduled_time, to avoid confusion with scheduled mass upgrades.
    • Add a pending status to represent an upgrade operation waiting for the device to come back online.
    • These tasks must survive system restarts and database migrations.
  • Device online detection

    • Prefer using the health_status_changed signal from OpenWISP Monitoring (mock it in CI/testing).
    • Fallback: periodic retries with randomized exponential backoff (configurable, max once every 12 hours).
  • Retry strategy

    • Randomized exponential backoff with indefinite retries.
    • Persistence should apply only to failures where the upgrade could not really start because the device was unreachable/offline. Other failures, such as checksum errors or failures after flashing started, should continue to require manual review.
    • We need a periodic reminder, the default period can be 2 months. When the end of the period is reached, we shall notify the admins via generic_notification about devices still waiting for upgrade, we need to link the notification to the mass upgrade operation, let's make sure the link automatically filters devices still pending upgrade (this should be already implemented and should be a matter of using the right URL, but let's double check).
    • Reminders repeat periodically until the admin cancels the operation or all devices are upgraded.
  • Integration with Celery

    • Existing upgrade operations already run as Celery tasks.
    • Introduce a new task (or tasks) to “wake up” pending upgrades for mass upgrades.
    • Randomized retry delays prevent all upgrades from running simultaneously and overloading the system.
  • Failure handling & notifications

    • Failures that need human attention: devices offline too long, upgrade errors, checksum issues.
    • Use generic_notification to inform the admins.
    • Failed upgrades require manual review; automatic retries after fixes are optional.
  • Edge cases

    • If health_status_changed signal trigges the wake up of an upgrade opration while the upgrade operation is seemingly already woken up → ignore and do nothing.
    • Retrying a pending operation must be idempotent. Use an atomic transition from pending to in-progress so concurrent signal/periodic wake-ups cannot dispatch duplicate retries.
    • If a device is deactivated while an operation is pending, the pending operation should stop retrying and be marked as failed.
    • Existing queue/conflict logic should be verified and updated where needed so pending operations are treated as active conflicts, not as completed/failed operations.
    • Only one upgrade per device is allowed; queue order and conflicts area already handled and no changes should be needed.
    • No rollback support post-flashing as it's not technically possible and firmware conflicts already managed by existing logic.
  • Scalability

    • No hard-coded limit on pending tasks; system scales with available workers.
    • Randomized backoff prevents broker/database overload.
    • No batching required; retries remain random for now.
  • Metrics & observability

    • Track retry counts for upgrade oprations (visible in admin/REST API).
    • Expose pending status in admin filters and REST API filters.
    • Ensure batch progress and WebSocket/admin UI handle pending correctly: pending operations must not be counted as completed, and the UI should show that they are waiting for retry.
    • Failures already handled by existing logic.
    • Minimal admin UI and REST API exposure for filtering pending upgrade operations that are going to be retried.

The nitty-gritty of how to schedule retries, detect online devices, and handle notifications is left for contributors to explore.
Safety and reliability is key.

Describe alternatives you've considered

  • Retry failed tasks immediately → nope, offline devices would never get upgraded.
  • Manual re-triggering → scales about as well as a greased pig on ice.

Additional context

  • This feature is all about making mass upgrades less of a headache and more like magic.
  • Contributors are encouraged to explore different strategies for persistent queuing, backoff logic, and Celery integration.
  • Consider edge cases like very long offline periods, and safe notification handling.

Constraints

  • Test coverage must not decrease.
  • Basic browser tests for UI related features are required.
  • Documentation needs to be updated to include this new feature, including updating any existing screenshots that may change after implementation.
  • We also need a short example usage video for YouTube that we can showcase on the website/documentation.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestgsoc-ideaIssues part of Google Summer of Code project

    Projects

    Status

    To do (Device management)

    Status

    ToDo

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions