Skip to content

[WIP] zedmanager: fix purge stuck when app was never activated#5742

Draft
eriknordmark wants to merge 1 commit intolf-edge:masterfrom
eriknordmark:persist2memory
Draft

[WIP] zedmanager: fix purge stuck when app was never activated#5742
eriknordmark wants to merge 1 commit intolf-edge:masterfrom
eriknordmark:persist2memory

Conversation

@eriknordmark
Copy link
Copy Markdown
Contributor

DO NOT REVIEW YET - TESTS IN PROGRESS

Description

There was a regression introduced in 16.9.0 from #5618 in that it did not take into account an application which has never started and is purged. We have a test case for that in the old ztests collection (test_app_purge_bad) but we do not seem to have such a case in the Eden tests.

This PR addresses that.

When a purge is triggered for an app that never started (e.g. its image failed to download), doUpdate transitions PurgeInprogress from DownloadAndVerify to BringDown and calls doRemove. Because no domain was ever created, doRemove returns done=true immediately. Without this fix the code fell through with PurgeInprogress still set to BringDown, skipped the RecreateVolumes check, and entered doPrepare/doActivate without ever calling purgeCmdDone. The BringDown state left in the published status then caused every subsequent updateAIStatusUUID call to route to removeAIStatus, which would tear down the prematurely created domain and leave the app oscillating indefinitely.

Fix by using BringDown only as an internal sentinel for doRemove/doInactivate (so it treats the operation as a purge rather than an uninstall). If doRemove completes immediately there is nothing to halt, so skip persisting BringDown and transition directly to RecreateVolumes, calling purgeCmdDone to save the purge counter and clean up stale volume refs. HALTING is now only recorded when a domain is genuinely being waited on.

How to test and validate this PR

Create an edge app with an image with a bad URL (or bad sha256).
Deploy on app instance using that edge app - it should fail due to a download or verify failure.
Then update the edge app to use an image with a working URL (and a matching sha256) and purge the instance to make it pick up that new version. That should make the app instance get to running/online state.

Changelog notes

None (fixes a regression introduced in 16.9.0 hence invisible to LTS users)

PR Backports

  • 16.0-stable No
  • 14.5-stable No
  • 13.4-stable No

Checklist

  • I've provided a proper description

  • I've added the proper documentation

  • I've tested my PR on amd64 device

  • I've tested my PR on arm64 device

  • I've written the test verification instructions

  • I've set the proper labels to this PR

  • I've checked the boxes above, or I've provided a good reason why I didn't
    check them.

Please, check the boxes above after submitting the PR in interactive mode.

@eriknordmark eriknordmark requested review from rene and rucoder April 3, 2026 16:24
@eriknordmark eriknordmark requested a review from rouming as a code owner April 3, 2026 16:24
@eriknordmark eriknordmark marked this pull request as draft April 3, 2026 16:24
@github-actions github-actions bot requested a review from OhmSpectator April 3, 2026 16:25
@codecov
Copy link
Copy Markdown

codecov bot commented Apr 3, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 26.56%. Comparing base (2281599) to head (423f47e).
⚠️ Report is 439 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #5742      +/-   ##
==========================================
+ Coverage   19.52%   26.56%   +7.03%     
==========================================
  Files          19       24       +5     
  Lines        3021     4213    +1192     
==========================================
+ Hits          590     1119     +529     
- Misses       2310     2872     +562     
- Partials      121      222     +101     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

When a purge is triggered for an app that never started (e.g. its image
failed to download), doUpdate transitions PurgeInprogress from
DownloadAndVerify to BringDown and calls doRemove.  Because no domain
was ever created, doRemove returns done=true immediately.  Without this
fix the code fell through with PurgeInprogress still set to BringDown,
skipped the RecreateVolumes check, and entered doPrepare/doActivate
without ever calling purgeCmdDone.

This left the app stuck at state LOADED with VerifyOnly=true on the
VolumeRefConfig.  During DownloadAndVerify, doInstall publishes
VolumeRefConfig with VerifyOnly=true so that volumemgr downloads and
verifies the image without yet creating the volume.  The transition to
VerifyOnly=false is only made by doInstall when PurgeInprogress is
NotInprogress or RecreateVolumes.  Because the RecreateVolumes→BringUp
transition fired in the same doUpdate call before doInstall ran with
RecreateVolumes, VerifyOnly was never cleared, volumemgr never created
the volume, and the app remained stuck at LOADED indefinitely.

Fix by using BringDown only as an internal sentinel for
doRemove/doInactivate (so it treats the operation as a purge rather than
an uninstall).  If doRemove completes immediately there is nothing to
halt, so skip persisting BringDown and transition directly to
RecreateVolumes, calling purgeCmdDone to save the purge counter and
clean up stale volume refs.  Then re-run doInstall with RecreateVolumes
so that VerifyOnly is cleared and volumemgr is told to create the
volume before the RecreateVolumes→BringUp transition fires.  HALTING is
now only recorded when a domain is genuinely being waited on.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: eriknordmark <erik@zededa.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant