-
-
Notifications
You must be signed in to change notification settings - Fork 93
[feature] Persistent Mass Upgrades - schema, retry backoff, batch state and Beat scanner #436
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
Eeshu-Yadav
wants to merge
34
commits into
gsoc26-persistent-scheduled-upgrades
Choose a base branch
from
issues/417-persistence-schema-fields
base: gsoc26-persistent-scheduled-upgrades
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
34 commits
Select commit
Hold shift + click to select a range
8e64afe
[feature] Added persistence model fields and batch-to-child propagati…
Eeshu-Yadav da23c61
[feature] Added pending status and migration for persistence schema f…
Eeshu-Yadav 22991e2
[feature] Locked is_persistent flag after launch via clean() guard #417
Eeshu-Yadav 6228b83
[tests] Added tests for persistence model fields, propagation and imm…
Eeshu-Yadav 59bb81d
[feature] Added persistent retry branch with exponential backoff to f…
Eeshu-Yadav ee09bd2
[tests] Added tests for persistent retry branch and backoff calculati…
Eeshu-Yadav 972cf25
[change] Wired pending status into batch aggregation, cancellation an…
Eeshu-Yadav fadfaf0
[tests] Added mixed success+pending batch tests for pending state coh…
Eeshu-Yadav 851c6be
[change] Added pending counter, pending_count property and X complete…
Eeshu-Yadav ff45bb0
[change] Fixed WebSocket consumer snapshots to handle pending status …
Eeshu-Yadav 379a5f8
[change] Extended WebSocket batch payload with pending count and X co…
Eeshu-Yadav 5c959e2
[tests] Added pending_count and WebSocket consumer snapshot regressio…
Eeshu-Yadav 98dd4b8
[tests] Asserted ValidationError messages and verified next_retry_at …
Eeshu-Yadav 2f1cf28
[tests] Added override_settings, retry_count=0 guard and non-Recovera…
Eeshu-Yadav 0303cb2
[tests] Strengthened concurrent-guard test and added WebSocket push-p…
Eeshu-Yadav b5e6d6d
[feature] Added Beat scanner and retry worker for persistent pending …
Eeshu-Yadav 21c2dc1
[tests] Added tests for persistent retry pipeline #424
Eeshu-Yadav 30874c5
[change] Extracted is_persistent immutability check and consolidated …
Eeshu-Yadav f368599
[docs] Documented persistent mass upgrades, pending status and new se…
Eeshu-Yadav 5ebcf4c
[tests] Added Selenium assertion for the pending-branch of the batch …
Eeshu-Yadav c5ca1d2
[feature] Added monitoring health_status_changed handler for fast per…
Eeshu-Yadav bef4c28
[fix] Handle deleted-op race in retry_pending_upgrade #424
Eeshu-Yadav 9a15d93
[feature] Added pending-upgrade reminders and failure-needs-attention…
Eeshu-Yadav 85fea73
[feature] Added is_persistent checkbox and pending-state admin surfac…
Eeshu-Yadav b753630
[fix] Registered dedicated notification types for persistent-upgrade …
Eeshu-Yadav ee189d6
[fix] Render pending and failed progress bars correctly #423
Eeshu-Yadav 4a32524
[fix] Extend deletion guard in admin to also block pending operations…
Eeshu-Yadav a8f2845
[tests] Use mock.patch.object in is_persistent tests #417
Eeshu-Yadav 1cd884c
[feature] Added is_persistent, retry_count and next_retry_at to REST …
Eeshu-Yadav bdd6a17
Merge branch 'gsoc26-persistent-scheduled-upgrades' into issues/417-p…
nemesifier baeef7c
[fix] Address review feedback on persistent upgrades #379
Eeshu-Yadav b404531
[tests] Add persistence retry-loop, time_travel helper and Selenium t…
Eeshu-Yadav 191a1c8
[fix] Address review feedback on persistent upgrades #379
Eeshu-Yadav 8c60dc3
[docs] Add Persistent Mass Upgrades page #379
Eeshu-Yadav File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,119 @@ | ||
| Persistent Mass Upgrades | ||
| ======================== | ||
|
|
||
| When a mass upgrade runs against a large fleet, some devices are usually | ||
| offline at that moment. Without persistence, each unreachable device ends | ||
| as ``failed`` once the immediate retries are exhausted, leaving the | ||
| operator to track down and re-launch every failed device by hand. | ||
|
|
||
| A *persistent* mass upgrade does not give up on offline devices. Instead | ||
| of marking them ``failed``, it parks them in the ``pending`` state with a | ||
| scheduled retry time and keeps retrying in the background until the device | ||
| comes back online or the operation is cancelled. | ||
|
|
||
| .. contents:: **Table of contents**: | ||
| :depth: 2 | ||
| :local: | ||
|
|
||
| How it works | ||
| ------------ | ||
|
|
||
| An operation whose device is unreachable transitions to ``pending`` | ||
| instead of ``failed``, with an incremented ``retry_count`` and an | ||
| exponential-backoff ``next_retry_at`` (10 minutes, doubling on each retry | ||
| up to a 12-hour cap, with ±25% jitter). A periodic Celery Beat task | ||
| re-dispatches pending operations once their retry time has elapsed, and | ||
| the batch stays ``in-progress`` until every device has either upgraded or | ||
| been cancelled. | ||
|
|
||
| .. image:: https://raw.githubusercontent.com/openwisp/openwisp-firmware-upgrader/docs/docs/images/1.4/persistent-mass-upgrade-batch.png | ||
| :target: https://raw.githubusercontent.com/openwisp/openwisp-firmware-upgrader/docs/docs/images/1.4/persistent-mass-upgrade-batch.png | ||
|
|
||
| The mass-upgrade page above stays ``in progress`` while one device is | ||
| still ``pending``, reporting ``2 complete, 1 pending`` and keeping the | ||
| batch open until the offline device is retried successfully or cancelled. | ||
|
|
||
| See :doc:`upgrade-status` for the full operation state machine and the | ||
| meaning of the ``pending`` state. | ||
|
|
||
| Enabling from the admin | ||
| ----------------------- | ||
|
|
||
| On the mass-upgrade confirmation page (reached from a build's *Upgrade* | ||
| action) the **persistent** checkbox is shown pre-checked. Leave it checked | ||
| to keep retrying offline devices, or uncheck it to fall back to the | ||
| behaviour where unreachable devices end as ``failed``. | ||
|
|
||
| .. image:: https://raw.githubusercontent.com/openwisp/openwisp-firmware-upgrader/docs/docs/images/1.4/persistent-mass-upgrade-confirm.png | ||
| :target: https://raw.githubusercontent.com/openwisp/openwisp-firmware-upgrader/docs/docs/images/1.4/persistent-mass-upgrade-confirm.png | ||
|
|
||
| The flag is locked in once the mass upgrade leaves the ``idle`` state, so | ||
| it cannot be changed midway through a running batch. | ||
|
|
||
| Enabling via the REST API | ||
| ~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
|
||
| The mass-upgrade endpoint accepts an ``is_persistent`` field that defaults | ||
| to ``true``; the single-device upgrade endpoint accepts the same field but | ||
| defaults to ``false``. See :doc:`rest-api` for the full request and | ||
| response reference. | ||
|
|
||
| Finding pending operations | ||
| -------------------------- | ||
|
|
||
| Pending operations are listed in the upgrade-operation admin and can be | ||
| isolated with the ``status`` filter set to ``pending``. The list shows the | ||
| ``persistent`` flag and the ``retry_count`` column, the latter being how | ||
| many times an operation has been retried so far. | ||
|
|
||
| .. image:: https://raw.githubusercontent.com/openwisp/openwisp-firmware-upgrader/docs/docs/images/1.4/persistent-upgrade-pending-changelist.png | ||
| :target: https://raw.githubusercontent.com/openwisp/openwisp-firmware-upgrader/docs/docs/images/1.4/persistent-upgrade-pending-changelist.png | ||
|
|
||
| An operation's detail page adds ``next_retry_at`` (when the next attempt | ||
| is scheduled) and a log that records each attempt, ending with the | ||
| backoff-scheduled ``persistent retry`` line for the next run. | ||
|
|
||
| .. image:: https://raw.githubusercontent.com/openwisp/openwisp-firmware-upgrader/docs/docs/images/1.4/persistent-upgrade-operation-pending.png | ||
| :target: https://raw.githubusercontent.com/openwisp/openwisp-firmware-upgrader/docs/docs/images/1.4/persistent-upgrade-operation-pending.png | ||
|
|
||
| Cancelling a pending operation | ||
| ------------------------------ | ||
|
|
||
| A pending operation is still active, so it can be cancelled the same way | ||
| as an in-progress one — from the admin cancel button or the REST cancel | ||
| endpoint. Cancelling stops the retry loop and moves the operation to | ||
| ``cancelled``. A pending operation cannot be *deleted* until it reaches a | ||
| terminal state (see :ref:`deleting_upgrade_operations`). | ||
|
|
||
| Notifications | ||
| ------------- | ||
|
|
||
| Two notifications keep operators informed about long-running persistent | ||
| upgrades: | ||
|
|
||
| - a **reminder** fires when a persistent batch still has pending children | ||
| after the configured cadence has elapsed, and | ||
| - a **failure** notification fires when a persistent operation finally | ||
| ends as ``failed`` (for example, the device was deactivated while | ||
| pending). | ||
|
|
||
| Both are delivered to the organization's administrators (and superusers). | ||
|
|
||
| .. image:: https://raw.githubusercontent.com/openwisp/openwisp-firmware-upgrader/docs/docs/images/1.4/persistent-upgrade-notifications.png | ||
| :target: https://raw.githubusercontent.com/openwisp/openwisp-firmware-upgrader/docs/docs/images/1.4/persistent-upgrade-notifications.png | ||
|
|
||
| The cadence and related settings are documented in :doc:`settings`. | ||
|
|
||
| Behaviour with and without openwisp-monitoring | ||
| ---------------------------------------------- | ||
|
|
||
| Persistent upgrades work with Celery Beat alone: the periodic scan retries | ||
| due pending operations on a fixed cadence. Installing | ||
| ``openwisp-monitoring`` adds a faster wake-up path — a device returning to | ||
| a healthy state triggers its pending retries immediately, without waiting | ||
| for the next scan. When ``openwisp-monitoring`` is not installed, the Beat | ||
| scan remains the only retry trigger. | ||
|
|
||
| The periodic tasks (``check_pending_upgrades`` and | ||
| ``send_pending_upgrade_reminders``) must be present in the deployment's | ||
| ``CELERY_BEAT_SCHEDULE``; see :doc:`settings`. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This new section says pending operations can be cancelled, but the cancellation section later still says cancellation is possible only while the status is
in-progress. I would update that bullet to mention bothin-progressandpending.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated the cancellation bullet and the cancel-button line to mention both in-progress and pending.