[WIP] zedmanager: fix purge stuck when app was never activated#5742
Draft
eriknordmark wants to merge 1 commit intolf-edge:masterfrom
Draft
[WIP] zedmanager: fix purge stuck when app was never activated#5742eriknordmark wants to merge 1 commit intolf-edge:masterfrom
eriknordmark wants to merge 1 commit intolf-edge:masterfrom
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #5742 +/- ##
==========================================
+ Coverage 19.52% 26.56% +7.03%
==========================================
Files 19 24 +5
Lines 3021 4213 +1192
==========================================
+ Hits 590 1119 +529
- Misses 2310 2872 +562
- Partials 121 222 +101 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
When a purge is triggered for an app that never started (e.g. its image failed to download), doUpdate transitions PurgeInprogress from DownloadAndVerify to BringDown and calls doRemove. Because no domain was ever created, doRemove returns done=true immediately. Without this fix the code fell through with PurgeInprogress still set to BringDown, skipped the RecreateVolumes check, and entered doPrepare/doActivate without ever calling purgeCmdDone. This left the app stuck at state LOADED with VerifyOnly=true on the VolumeRefConfig. During DownloadAndVerify, doInstall publishes VolumeRefConfig with VerifyOnly=true so that volumemgr downloads and verifies the image without yet creating the volume. The transition to VerifyOnly=false is only made by doInstall when PurgeInprogress is NotInprogress or RecreateVolumes. Because the RecreateVolumes→BringUp transition fired in the same doUpdate call before doInstall ran with RecreateVolumes, VerifyOnly was never cleared, volumemgr never created the volume, and the app remained stuck at LOADED indefinitely. Fix by using BringDown only as an internal sentinel for doRemove/doInactivate (so it treats the operation as a purge rather than an uninstall). If doRemove completes immediately there is nothing to halt, so skip persisting BringDown and transition directly to RecreateVolumes, calling purgeCmdDone to save the purge counter and clean up stale volume refs. Then re-run doInstall with RecreateVolumes so that VerifyOnly is cleared and volumemgr is told to create the volume before the RecreateVolumes→BringUp transition fires. HALTING is now only recorded when a domain is genuinely being waited on. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: eriknordmark <erik@zededa.com>
192fffb to
423f47e
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
DO NOT REVIEW YET - TESTS IN PROGRESS
Description
There was a regression introduced in 16.9.0 from #5618 in that it did not take into account an application which has never started and is purged. We have a test case for that in the old ztests collection (test_app_purge_bad) but we do not seem to have such a case in the Eden tests.
This PR addresses that.
When a purge is triggered for an app that never started (e.g. its image failed to download), doUpdate transitions PurgeInprogress from DownloadAndVerify to BringDown and calls doRemove. Because no domain was ever created, doRemove returns done=true immediately. Without this fix the code fell through with PurgeInprogress still set to BringDown, skipped the RecreateVolumes check, and entered doPrepare/doActivate without ever calling purgeCmdDone. The BringDown state left in the published status then caused every subsequent updateAIStatusUUID call to route to removeAIStatus, which would tear down the prematurely created domain and leave the app oscillating indefinitely.
Fix by using BringDown only as an internal sentinel for doRemove/doInactivate (so it treats the operation as a purge rather than an uninstall). If doRemove completes immediately there is nothing to halt, so skip persisting BringDown and transition directly to RecreateVolumes, calling purgeCmdDone to save the purge counter and clean up stale volume refs. HALTING is now only recorded when a domain is genuinely being waited on.
How to test and validate this PR
Create an edge app with an image with a bad URL (or bad sha256).
Deploy on app instance using that edge app - it should fail due to a download or verify failure.
Then update the edge app to use an image with a working URL (and a matching sha256) and purge the instance to make it pick up that new version. That should make the app instance get to running/online state.
Changelog notes
None (fixes a regression introduced in 16.9.0 hence invisible to LTS users)
PR Backports
Checklist
I've provided a proper description
I've added the proper documentation
I've tested my PR on amd64 device
I've tested my PR on arm64 device
I've written the test verification instructions
I've set the proper labels to this PR
I've checked the boxes above, or I've provided a good reason why I didn't
check them.
Please, check the boxes above after submitting the PR in interactive mode.