Currently, firmware upgrades in OpenWISP happen immediately via Celery tasks.
If a device is offline at the moment of glory, the task fails, and someone has to manually retry it.
This becomes a nightmare in large deployments: admins can't just schedule upgrades and walk away.
What our users want: mass upgrades that patiently wait for devices to come back online and then do their job automatically.
Describe the solution you'd like
Introduce support for persistent upgrade tasks that stick around until offline devices finally wake up and get upgraded.
This change is mainly aimed at mass upgrade operations for now; But it should be possible to trigger persistent single device upgrades too, for example, we could support doing this only via REST API and Python scripting for now, and we'll implement the UI for persistent single upgrades in a future iteration.
The system should ask users whether they want the mass upgrade to retry indefinitely, with the default option suggested as enabled (checked by default in both the admin interface and REST API).
This will likely require a new boolean database column, for example persistent (yes/no), which should be visible in the admin and REST API lists for mass upgrade operations.
Once a mass upgrade operation has been launched, this field must not be changeable.
Suggested implementation detail:
- The persistence decision should live on the per-device
UpgradeOperation, so standalone upgrades can opt in through REST API or Python scripting.
BatchUpgradeOperation should also expose persistent as the mass-upgrade policy, defaulting to enabled, and propagate its value to child UpgradeOperation records when they are created.
- Standalone single-device upgrades should keep the current fail-fast behavior by default unless
persistent=True is explicitly requested.
Some ideas and guidelines for contributors (but feel free to do your own research and come up with better ideas if deemed proper):
The nitty-gritty of how to schedule retries, detect online devices, and handle notifications is left for contributors to explore.
Safety and reliability is key.
Describe alternatives you've considered
- Retry failed tasks immediately → nope, offline devices would never get upgraded.
- Manual re-triggering → scales about as well as a greased pig on ice.
Additional context
- This feature is all about making mass upgrades less of a headache and more like magic.
- Contributors are encouraged to explore different strategies for persistent queuing, backoff logic, and Celery integration.
- Consider edge cases like very long offline periods, and safe notification handling.
Constraints
- Test coverage must not decrease.
- Basic browser tests for UI related features are required.
- Documentation needs to be updated to include this new feature, including updating any existing screenshots that may change after implementation.
- We also need a short example usage video for YouTube that we can showcase on the website/documentation.
Currently, firmware upgrades in OpenWISP happen immediately via Celery tasks.
If a device is offline at the moment of glory, the task fails, and someone has to manually retry it.
This becomes a nightmare in large deployments: admins can't just schedule upgrades and walk away.
What our users want: mass upgrades that patiently wait for devices to come back online and then do their job automatically.
Describe the solution you'd like
Introduce support for persistent upgrade tasks that stick around until offline devices finally wake up and get upgraded.
This change is mainly aimed at mass upgrade operations for now; But it should be possible to trigger persistent single device upgrades too, for example, we could support doing this only via REST API and Python scripting for now, and we'll implement the UI for persistent single upgrades in a future iteration.
The system should ask users whether they want the mass upgrade to retry indefinitely, with the default option suggested as enabled (checked by default in both the admin interface and REST API).
This will likely require a new boolean database column, for example
persistent(yes/no), which should be visible in the admin and REST API lists for mass upgrade operations.Once a mass upgrade operation has been launched, this field must not be changeable.
Suggested implementation detail:
UpgradeOperation, so standalone upgrades can opt in through REST API or Python scripting.BatchUpgradeOperationshould also exposepersistentas the mass-upgrade policy, defaulting to enabled, and propagate its value to childUpgradeOperationrecords when they are created.persistent=Trueis explicitly requested.Some ideas and guidelines for contributors (but feel free to do your own research and come up with better ideas if deemed proper):
Persistent task records
UpgradeOperationmodel for most of the data.retry_countand optionallyscheduled_timeto track when the next retry is due.next_retry_atrather thanscheduled_time, to avoid confusion with scheduled mass upgrades.pendingstatus to represent an upgrade operation waiting for the device to come back online.Device online detection
health_status_changedsignal from OpenWISP Monitoring (mock it in CI/testing).Retry strategy
generic_notificationabout devices still waiting for upgrade, we need to link the notification to the mass upgrade operation, let's make sure the link automatically filters devices still pending upgrade (this should be already implemented and should be a matter of using the right URL, but let's double check).Integration with Celery
Failure handling & notifications
generic_notificationto inform the admins.Edge cases
health_status_changedsignal trigges the wake up of an upgrade opration while the upgrade operation is seemingly already woken up → ignore and do nothing.pendingtoin-progressso concurrent signal/periodic wake-ups cannot dispatch duplicate retries.pendingoperations are treated as active conflicts, not as completed/failed operations.Scalability
Metrics & observability
pendingstatus in admin filters and REST API filters.pendingcorrectly: pending operations must not be counted as completed, and the UI should show that they are waiting for retry.The nitty-gritty of how to schedule retries, detect online devices, and handle notifications is left for contributors to explore.
Safety and reliability is key.
Describe alternatives you've considered
Additional context
Constraints