Don't fail Supervisor setup when an app image is missing#6816
Conversation
Stack trace of the original issue:
With this change, the Supervisor handles this error (and other similar ones) more gracefully: |
sairon
left a comment
There was a problem hiding this comment.
A missing builder image (e.g. docker:29.4.2-cli, when the host's exact Docker patch version has no matching -cli tag published on Docker Hub)
Note that this is a situation that shouldn't normally happen. But because Docker messed up something in their packaging, their CI is failing and 29.4.2 images are missing on Docker Hub. Normally they're published within a day since publishing. Also, OS build will fail too if those images are not published, so this only affects early adopters on Supervised and dev environments.
|
For everyone who found this issue and needs a quick workaround: docker pull docker:29.4.1-cli
docker tag docker:29.4.1-cli docker:29.4.2-cliOf course, this is a ugly fix and shouldn't be used long term. To remove the tag, just run |
|
29.4.3 was released yesterday and it's making its way to the Docker Hub: docker-library/docker@85f8094 So hopefully it should be resolved within a day or two. |
mdegat01
left a comment
There was a problem hiding this comment.
I guess comment is best here. I don't know if changes are needed but I do need to confirm some stuff, namely about whether we actually want to delay image builds until after setup or just want to keep the exceptions from breaking setup. Because currently we're only doing the latter with this change, image builds will still occur during setup just at a different step.
| # Docker error other than a clean "image not found" - we can't | ||
| # tell whether the image is actually missing. Log and leave the | ||
| # addon detached; a future load will reattempt and surface a | ||
| # MISSING_IMAGE repair if appropriate. |
There was a problem hiding this comment.
This references a "future load" that will fix this. But there is no future load. There's only 3 times we call load right now:
- Startup of Supervisor we call it for each installed addon
- On install of a new app
- On restore of an app (but only if it was newly installed so really this is still 2 )
Could be just this comment is incorrect which is nbd but wanted to make sure there wasn't a misunderstanding. Currently if this load/attach process fails there is no fallback/retry mechanism in place. If we need that now, we have to add that.
There was a problem hiding this comment.
Right, this needs a manual interaction. We don't raise a repair for this issue currently, I don't think its worth the effort this is a corner case. Users affected can just go to the app page and trigger a rebuild 🤷 .
| # Dockerfile or unavailable base/builder image; disk full; bad | ||
| # credentials; registry rate limit). Surface as a fixup error so | ||
| # FixupBase swallows it without a Sentry event. The repair stays | ||
| # available for manual retry once the underlying cause is fixed. |
There was a problem hiding this comment.
Something I realized looking at this - there's only one instance of this class created total. Its created at Supervisor startup and we use the same instance during the entire time Supervisor is running. So self.attempts is never reset, once you get 5 failures then this is just a manual fixup until Supervisor is restarted.
If we're trying to improve this fixup, maybe we want to reset attempts on success? Or make give each addon its own attempts count using a dictionary? Existing issue so doesn't have to be tackled here, just noting it.
There was a problem hiding this comment.
Yeah since this is a rather corner case issue I'd rather prefer to not add more complexity.
| # Don't run a local build during setup. Surface a repair so | ||
| # the resolution autofix loop can handle it off the critical | ||
| # path. | ||
| self._create_missing_image_issue() |
There was a problem hiding this comment.
This won't actually move this logic out of setup btw. It will move it out of setup of AddonManager but ResolutionManager is loaded afterwards here:
Line 184 in 78d3bb9
And as part of its load it runs a healthcheck which then runs autofixes. If your goal is simply to prevent exceptions raised from building from breaking setup then that still accomplishes that, since exceptions raised by autofix fixups won't break setup. But if your goal is to stop setup from waiting for images to be built then you should adjust this logic:
supervisor/supervisor/resolution/fixups/addon_execute_repair.py
Lines 64 to 67 in 78d3bb9
To something like this:
@property
def auto(self) -> bool:
"""Return if a fixup can be apply as auto fix."""
return self.sys_core.state not in CoreState.SETUP and self.attempts < MAX_AUTO_ATTEMPTSOr provide a fixed list of states you want CoreState to be in.
Bear in mind though, there is currently no other healthcheck between the end of SETUP and when apps are started during STARTUP. So currently any addons which exit SETUP without their image available will effectively have boot disabled. Since they will fail to start during boot and then will have to manually started after. Unless we add another healthcheck in at the top of Core.start.
Which on that note, we should probably temporarily disable boot on any addons which we have decided we cannot download or build an image for right now. Else we'll just try again during STARTUP and fail again.
There was a problem hiding this comment.
And as part of its load it runs a healthcheck which then runs autofixes. If your goal is simply to prevent exceptions raised from building from breaking setup then that still accomplishes that, since exceptions raised by autofix fixups won't break setup.
It is certainly the main aim of this PR.
But if your goal is to stop setup from waiting for images to be built then you should adjust this logic:
So that came as an afterthought: How often do we even need to build on setup? I encountered this on my development system, where I had an app which no longer builds. I've cleaned Docker images at one point, that is probably why I've started running into it. Once you have such a non-building app, you'll encounter it on every startup, and it will slowdown the start. So I felt like let's punt this.
I have no idea how often users run into this, probably almost never. If build fails on install, we rollback the installation of an app, so normally users should not encounter this at all.
From what I can tell this is really a corner case scenario, so I can life with either approach.
There was a problem hiding this comment.
This won't actually move this logic out of setup btw. It will move it out of setup of
AddonManagerbutResolutionManageris loaded afterwards here:
Actually, it does: run_autofix has JobCondition.RUNNING. So the code as is already defers to running.
There was a problem hiding this comment.
I missed that, that covers it then.
A missing builder image (docker:<version>-cli) during a build-required app load aborted Supervisor setup entirely, leaving the system stuck in setup state where every subsequent operation was blocked by the not-healthy guard. Triggered in practice when the host's Docker patch version had no matching `-cli` tag published on Docker Hub. Two issues compounded the failure: `images.pull` in `run_command` leaked a raw `aiodocker.DockerError` past the `@Job` decorator, which rewrapped it as `JobException` and bypassed the `suppress(DockerError, ...)` guard in `addon.load()`; and the load path treated all Docker errors the same whether the image was simply missing or the daemon itself was misbehaving. Wrap the pull error in `run_command` so it propagates as Supervisor's `DockerError` (a `HassioError`) and is preserved by the decorator. Distinguish 404s in `attach()` and `check_image()` by raising `DockerNotFound`/`DockerAPIError` instead of generic `DockerError`. In `addon.load()`, only the `DockerNotFound` path is treated as "image missing": for build-required apps we skip the inline build and surface a `MISSING_IMAGE` repair so the resolution autofix loop handles it off the critical path; for pull-based apps we still attempt install during load and create the repair on failure. Other `DockerError`s (daemon trouble or a failed internal install in `check_image`) are logged at CRITICAL — which the Sentry logging integration captures — and the addon is left detached rather than masked as a misleading missing-image repair. In the autofix path, swallow `DockerBuildError`, `DockerNoSpaceOnDevice`, `DockerRegistryAuthError`, and `DockerRegistryRateLimitExceeded` as `ResolutionFixupError` so they don't generate Sentry events on every retry. The repair stays available for manual retry once the underlying cause (registry tag published, disk freed, credentials fixed, rate limit expired) is resolved.
afc1165 to
183e66f
Compare
The comment claimed "a future load will reattempt and surface a MISSING_IMAGE repair if appropriate", but App.load() is only called at Supervisor startup, on fresh install, and on backup restore — there is no automatic retry mechanism. Reword to match reality: the CRITICAL log captures the issue for diagnostics (Sentry), and the user can trigger a manual repair once the daemon is healthy.
9aa6665 to
cbf75ba
Compare
Proposed change
A missing builder image (e.g.
docker:29.4.2-cli, when the host's exact Docker patch version has no matching-clitag published on Docker Hub) during a build-required app load aborted Supervisor setup entirely. The system was left in setup state where every subsequent operation was blocked by the not-healthy guard. Recovery required either Docker Hub publishing the tag or a manual workaround.Two issues compounded the failure:
images.pullinDockerAPI.run_commandleaked a rawaiodocker.DockerErrorpast the@Jobdecorator. Sinceaiodocker.DockerErroris not aHassioError, the decorator rewrapped it asJobException, which then bypassed thesuppress(DockerError, ...)guard inApp.load()that was designed to keep one bad app from killing setup.App.load()treated all Docker errors the same — a 404 "image not in cache" was indistinguishable from a "daemon is sick" 5xx, so a real install attempt could fall through into thewith suppress(...)and silently succeed-or-fail without surfacing anything to the user.This PR addresses both:
run_commandso it propagates as Supervisor'sDockerError(aHassioError) and is preserved unchanged by the@Jobdecorator.DockerInterface.attach()andDockerInterface.check_image()by raisingDockerNotFound/DockerAPIErrorinstead of genericDockerError.App.load(), only theDockerNotFoundpath is treated as "image missing":MISSING_IMAGErepair (withEXECUTE_REPAIRsuggestion) is created so the resolution autofix loop handles it off the setup critical path.DockerErrors (daemon trouble, or a failed internal install insidecheck_image's arch-mismatch path) are logged atCRITICAL— which the Sentry logging integration captures as an event — and the app is left detached. We deliberately do not raise aMISSING_IMAGErepair in that case because it would promise a fix the autofix can't deliver (those errors are not resolved by retryinginstall()).FixupAppExecuteRepair, swallowDockerBuildError,DockerNoSpaceOnDevice,DockerRegistryAuthError, andDockerRegistryRateLimitExceededasResolutionFixupErrorso they don't generate a Sentry event on every retry. The repair stays available for manual retry once the underlying cause (registry tag published, disk freed, credentials fixed, rate limit expired) is resolved.Type of change
Additional information
Checklist
ruff format supervisor tests)If API endpoints or add-on configuration are added/changed: