Don't fail Supervisor setup when an app image is missing by agners · Pull Request #6816 · home-assistant/supervisor

agners · 2026-05-06T14:57:29Z

Proposed change

A missing builder image (e.g. docker:29.4.2-cli, when the host's exact Docker patch version has no matching -cli tag published on Docker Hub) during a build-required app load aborted Supervisor setup entirely. The system was left in setup state where every subsequent operation was blocked by the not-healthy guard. Recovery required either Docker Hub publishing the tag or a manual workaround.

Two issues compounded the failure:

images.pull in DockerAPI.run_command leaked a raw aiodocker.DockerError past the @Job decorator. Since aiodocker.DockerError is not a HassioError, the decorator rewrapped it as JobException, which then bypassed the suppress(DockerError, ...) guard in App.load() that was designed to keep one bad app from killing setup.
App.load() treated all Docker errors the same — a 404 "image not in cache" was indistinguishable from a "daemon is sick" 5xx, so a real install attempt could fall through into the with suppress(...) and silently succeed-or-fail without surfacing anything to the user.

This PR addresses both:

Wrap the pull error in run_command so it propagates as Supervisor's DockerError (a HassioError) and is preserved unchanged by the @Job decorator.
Distinguish 404s in DockerInterface.attach() and DockerInterface.check_image() by raising DockerNotFound/DockerAPIError instead of generic DockerError.
In App.load(), only the DockerNotFound path is treated as "image missing":
- For build-required apps the inline build is skipped and a MISSING_IMAGE repair (with EXECUTE_REPAIR suggestion) is created so the resolution autofix loop handles it off the setup critical path.
- For pull-based apps the install is still attempted during load and the repair is created on failure, preserving the existing recovery behavior.
Other DockerErrors (daemon trouble, or a failed internal install inside check_image's arch-mismatch path) are logged at CRITICAL — which the Sentry logging integration captures as an event — and the app is left detached. We deliberately do not raise a MISSING_IMAGE repair in that case because it would promise a fix the autofix can't deliver (those errors are not resolved by retrying install()).
In FixupAppExecuteRepair, swallow DockerBuildError, DockerNoSpaceOnDevice, DockerRegistryAuthError, and DockerRegistryRateLimitExceeded as ResolutionFixupError so they don't generate a Sentry event on every retry. The repair stays available for manual retry once the underlying cause (registry tag published, disk freed, credentials fixed, rate limit expired) is resolved.

Type of change

Dependency upgrade
Bugfix (non-breaking change which fixes an issue)
New feature (which adds functionality to the supervisor)
Breaking change (fix/feature causing existing functionality to break)
Code quality improvements to existing code or addition of tests

Additional information

This PR fixes or closes issue: fixes #
This PR is related to issue:
Link to documentation pull request:
Link to cli pull request:
Link to client library pull request:

Checklist

The code change is tested and works locally.
Local tests pass. Your PR cannot be merged unless tests pass
There is no commented out code in this PR.
I have followed the development checklist
The code has been formatted using Ruff (ruff format supervisor tests)
Tests have been added to verify that the new code works.

If API endpoints or add-on configuration are added/changed:

Documentation added/updated for developers.home-assistant.io
CLI updated (if necessary)
Client library updated (if necessary)

agners · 2026-05-06T14:59:38Z

Stack trace of the original issue:

2026-05-06 14:11:01.499 ERROR (MainThread) [supervisor.jobs] Unhandled exception: [404] manifest for docker:29.4.2-cli not found: manifest unknown: manifest unknown

2026-05-06 14:11:01.499 ERROR (MainThread) [supervisor.jobs] Unhandled exception: [404] manifest for docker:29.4.2-cli not found: manifest unknown: manifest unknown
Traceback (most recent call last):
  File "/usr/src/supervisor/supervisor/addons/addon.py", line 257, in load
    await self.instance.attach(version=self.version)
  File "/usr/src/supervisor/supervisor/jobs/decorator.py", line 307, in wrapper
    raise err
  File "/usr/src/supervisor/supervisor/jobs/decorator.py", line 299, in wrapper
    return await method(obj, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/src/supervisor/supervisor/docker/interface.py", line 457, in attach
    raise DockerError(
        f"Could not get metadata on container or image for {self.name}"
    )
supervisor.exceptions.DockerError: Could not get metadata on container or image for addon_f4f71350_ewelink_smart_home_slug

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/src/supervisor/supervisor/docker/manager.py", line 641, in run_command
    await self.images.inspect(f"{image}:{tag}")
  File "/usr/local/lib/python3.14/site-packages/aiodocker/images.py", line 48, in inspect
    response = await self.docker._query_json(f"images/{name}/json")
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.14/site-packages/aiodocker/docker.py", line 541, in _query_json
    async with self._query(
               ~~~~~~~~~~~^
        path,
        ^^^^^
    ...<6 lines>...
        versioned_api=versioned_api,
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    ) as response:
    ^
  File "/usr/local/lib/python3.14/contextlib.py", line 214, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.14/site-packages/aiodocker/docker.py", line 433, in _query
    yield await self._do_query(
          ^^^^^^^^^^^^^^^^^^^^^
    ...<9 lines>...
    )
    ^
  File "/usr/local/lib/python3.14/site-packages/aiodocker/docker.py", line 514, in _do_query
    raise DockerError(response.status, data["message"])
aiodocker.exceptions.DockerError: [404] No such image: docker:29.4.2-cli

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/src/supervisor/supervisor/jobs/decorator.py", line 299, in wrapper
    return await method(obj, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/src/supervisor/supervisor/docker/addon.py", line 673, in install
    await self._build(version, image)
  File "/usr/src/supervisor/supervisor/docker/addon.py", line 739, in _build
    result = await self.sys_docker.run_command(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    ...<4 lines>...
    )
    ^
  File "/usr/src/supervisor/supervisor/docker/manager.py", line 645, in run_command
    await self.images.pull(image, tag=tag)
  File "/usr/local/lib/python3.14/site-packages/aiodocker/images.py", line 154, in _handle_list
    async with cm as response:
               ^^
  File "/usr/local/lib/python3.14/contextlib.py", line 214, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.14/site-packages/aiodocker/docker.py", line 433, in _query
    yield await self._do_query(
          ^^^^^^^^^^^^^^^^^^^^^
    ...<9 lines>...
    )
    ^
  File "/usr/local/lib/python3.14/site-packages/aiodocker/docker.py", line 514, in _do_query
    raise DockerError(response.status, data["message"])
aiodocker.exceptions.DockerError: [404] manifest for docker:29.4.2-cli not found: manifest unknown: manifest unknown

With this change, the Supervisor handles this error (and other similar ones) more gracefully:

2026-05-06 14:43:08.842 INFO (MainThread) [supervisor.addons.addon] No f4f71350_ewelink_smart_home_slug app Docker image f4f71350/amd64-addon-ewelink_smart_home_slug found
2026-05-06 14:43:08.842 INFO (MainThread) [supervisor.resolution.module] Create new suggestion execute_repair - addon / f4f71350_ewelink_smart_home_slug
2026-05-06 14:43:08.842 INFO (MainThread) [supervisor.resolution.module] Create new issue missing_image - addon / f4f71350_ewelink_smart_home_slug
...
2026-05-06 15:43:13.948 ERROR (MainThread) [supervisor.docker.manager] Can't pull image docker:29.4.2-cli: [404] manifest for docker:29.4.2-cli not found: manifest unknown: manifest unknown
2026-05-06 15:43:13.948 ERROR (MainThread) [supervisor.docker.addon] Can't build f4f71350/amd64-addon-ewelink_smart_home_slug:1.4.6: Can't pull image docker:29.4.2-cli: [404] manifest for docker:29.4.2-cli not found: manifest unknown: manifest unknown
...
2026-05-06 15:43:13.949 WARNING (MainThread) [supervisor.resolution.fixup] Error during processing execute_repair: Can't build f4f71350/amd64-addon-ewelink_smart_home_slug:1.4.6: Can't pull image docker:29.4.2-cli: [404] manifest for docker:29.4.2-cli not found: manifest unknown: manifest unknown

sairon

A missing builder image (e.g. docker:29.4.2-cli, when the host's exact Docker patch version has no matching -cli tag published on Docker Hub)

Note that this is a situation that shouldn't normally happen. But because Docker messed up something in their packaging, their CI is failing and 29.4.2 images are missing on Docker Hub. Normally they're published within a day since publishing. Also, OS build will fail too if those images are not published, so this only affects early adopters on Supervised and dev environments.

RubenNL · 2026-05-06T18:18:10Z

For everyone who found this issue and needs a quick workaround:

docker pull docker:29.4.1-cli
docker tag docker:29.4.1-cli docker:29.4.2-cli

Of course, this is a ugly fix and shouldn't be used long term. To remove the tag, just run docker image rm docker:29.4.2-cli

sairon · 2026-05-07T08:26:02Z

29.4.3 was released yesterday and it's making its way to the Docker Hub: docker-library/docker@85f8094

So hopefully it should be resolved within a day or two.

mdegat01

I guess comment is best here. I don't know if changes are needed but I do need to confirm some stuff, namely about whether we actually want to delay image builds until after setup or just want to keep the exceptions from breaking setup. Because currently we're only doing the latter with this change, image builds will still occur during setup just at a different step.

mdegat01 · 2026-05-08T16:59:44Z

+            # Docker error other than a clean "image not found" - we can't
+            # tell whether the image is actually missing. Log and leave the
+            # addon detached; a future load will reattempt and surface a
+            # MISSING_IMAGE repair if appropriate.


This references a "future load" that will fix this. But there is no future load. There's only 3 times we call load right now:

Startup of Supervisor we call it for each installed addon

On install of a new app

On restore of an app (but only if it was newly installed so really this is still 2 )

Could be just this comment is incorrect which is nbd but wanted to make sure there wasn't a misunderstanding. Currently if this load/attach process fails there is no fallback/retry mechanism in place. If we need that now, we have to add that.

Right, this needs a manual interaction. We don't raise a repair for this issue currently, I don't think its worth the effort this is a corner case. Users affected can just go to the app page and trigger a rebuild 🤷 .

mdegat01 · 2026-05-08T17:08:39Z

+            # Dockerfile or unavailable base/builder image; disk full; bad
+            # credentials; registry rate limit). Surface as a fixup error so
+            # FixupBase swallows it without a Sentry event. The repair stays
+            # available for manual retry once the underlying cause is fixed.


Something I realized looking at this - there's only one instance of this class created total. Its created at Supervisor startup and we use the same instance during the entire time Supervisor is running. So self.attempts is never reset, once you get 5 failures then this is just a manual fixup until Supervisor is restarted.

If we're trying to improve this fixup, maybe we want to reset attempts on success? Or make give each addon its own attempts count using a dictionary? Existing issue so doesn't have to be tackled here, just noting it.

Yeah since this is a rather corner case issue I'd rather prefer to not add more complexity.

mdegat01 · 2026-05-08T17:27:07Z

+                # Don't run a local build during setup. Surface a repair so
+                # the resolution autofix loop can handle it off the critical
+                # path.
+                self._create_missing_image_issue()


This won't actually move this logic out of setup btw. It will move it out of setup of AddonManager but ResolutionManager is loaded afterwards here:

supervisor/supervisor/core.py

Line 184 in 78d3bb9

self.sys_resolution.load(),

And as part of its load it runs a healthcheck which then runs autofixes. If your goal is simply to prevent exceptions raised from building from breaking setup then that still accomplishes that, since exceptions raised by autofix fixups won't break setup. But if your goal is to stop setup from waiting for images to be built then you should adjust this logic:

supervisor/supervisor/resolution/fixups/addon_execute_repair.py

Lines 64 to 67 in 78d3bb9

@property

def auto(self) -> bool:

"""Return if a fixup can be apply as auto fix."""

return self.attempts < MAX_AUTO_ATTEMPTS

To something like this:

@property def auto(self) -> bool: """Return if a fixup can be apply as auto fix.""" return self.sys_core.state not in CoreState.SETUP and self.attempts < MAX_AUTO_ATTEMPTS

Or provide a fixed list of states you want CoreState to be in.

Bear in mind though, there is currently no other healthcheck between the end of SETUP and when apps are started during STARTUP. So currently any addons which exit SETUP without their image available will effectively have boot disabled. Since they will fail to start during boot and then will have to manually started after. Unless we add another healthcheck in at the top of Core.start.

Which on that note, we should probably temporarily disable boot on any addons which we have decided we cannot download or build an image for right now. Else we'll just try again during STARTUP and fail again.

And as part of its load it runs a healthcheck which then runs autofixes. If your goal is simply to prevent exceptions raised from building from breaking setup then that still accomplishes that, since exceptions raised by autofix fixups won't break setup.

It is certainly the main aim of this PR.

But if your goal is to stop setup from waiting for images to be built then you should adjust this logic:

So that came as an afterthought: How often do we even need to build on setup? I encountered this on my development system, where I had an app which no longer builds. I've cleaned Docker images at one point, that is probably why I've started running into it. Once you have such a non-building app, you'll encounter it on every startup, and it will slowdown the start. So I felt like let's punt this.

I have no idea how often users run into this, probably almost never. If build fails on install, we rollback the installation of an app, so normally users should not encounter this at all.

From what I can tell this is really a corner case scenario, so I can life with either approach.

This won't actually move this logic out of setup btw. It will move it out of setup of AddonManager but ResolutionManager is loaded afterwards here:

Actually, it does: run_autofix has JobCondition.RUNNING. So the code as is already defers to running.

I missed that, that covers it then.

A missing builder image (docker:<version>-cli) during a build-required app load aborted Supervisor setup entirely, leaving the system stuck in setup state where every subsequent operation was blocked by the not-healthy guard. Triggered in practice when the host's Docker patch version had no matching `-cli` tag published on Docker Hub. Two issues compounded the failure: `images.pull` in `run_command` leaked a raw `aiodocker.DockerError` past the `@Job` decorator, which rewrapped it as `JobException` and bypassed the `suppress(DockerError, ...)` guard in `addon.load()`; and the load path treated all Docker errors the same whether the image was simply missing or the daemon itself was misbehaving. Wrap the pull error in `run_command` so it propagates as Supervisor's `DockerError` (a `HassioError`) and is preserved by the decorator. Distinguish 404s in `attach()` and `check_image()` by raising `DockerNotFound`/`DockerAPIError` instead of generic `DockerError`. In `addon.load()`, only the `DockerNotFound` path is treated as "image missing": for build-required apps we skip the inline build and surface a `MISSING_IMAGE` repair so the resolution autofix loop handles it off the critical path; for pull-based apps we still attempt install during load and create the repair on failure. Other `DockerError`s (daemon trouble or a failed internal install in `check_image`) are logged at CRITICAL — which the Sentry logging integration captures — and the addon is left detached rather than masked as a misleading missing-image repair. In the autofix path, swallow `DockerBuildError`, `DockerNoSpaceOnDevice`, `DockerRegistryAuthError`, and `DockerRegistryRateLimitExceeded` as `ResolutionFixupError` so they don't generate Sentry events on every retry. The repair stays available for manual retry once the underlying cause (registry tag published, disk freed, credentials fixed, rate limit expired) is resolved.

The comment claimed "a future load will reattempt and surface a MISSING_IMAGE repair if appropriate", but App.load() is only called at Supervisor startup, on fresh install, and on backup restore — there is no automatic retry mechanism. Reword to match reality: the CRITICAL log captures the issue for diagnostics (Sentry), and the user can trigger a manual repair once the daemon is healthy.

mdegat01

Ok looks good, LGTM 👍

agners added the bugfix A bug fix label May 6, 2026

home-assistant Bot added the cla-signed label May 6, 2026

sairon reviewed May 6, 2026

View reviewed changes

mdegat01 reviewed May 8, 2026

View reviewed changes

agners force-pushed the improve-startup-missing-container-image-handling branch from afc1165 to 183e66f Compare May 18, 2026 17:03

agners force-pushed the improve-startup-missing-container-image-handling branch from 9aa6665 to cbf75ba Compare May 18, 2026 17:25

agners requested a review from mdegat01 May 18, 2026 17:27

Clarify comment about user interaction

55a412e

mdegat01 approved these changes May 20, 2026

View reviewed changes

agners merged commit 0bcedf5 into main May 20, 2026
20 of 21 checks passed

agners deleted the improve-startup-missing-container-image-handling branch May 20, 2026 15:59

	@property
	def auto(self) -> bool:
	"""Return if a fixup can be apply as auto fix."""
	return self.attempts < MAX_AUTO_ATTEMPTS

Conversation

agners commented May 6, 2026

Proposed change

Type of change

Additional information

Checklist

Uh oh!

agners commented May 6, 2026

Uh oh!

sairon left a comment

Choose a reason for hiding this comment

Uh oh!

RubenNL commented May 6, 2026

Uh oh!

sairon commented May 7, 2026

Uh oh!

mdegat01 left a comment

Choose a reason for hiding this comment

Uh oh!

mdegat01 May 8, 2026

Choose a reason for hiding this comment

Uh oh!

agners May 20, 2026

Choose a reason for hiding this comment

Uh oh!

mdegat01 May 8, 2026

Choose a reason for hiding this comment

Uh oh!

agners May 20, 2026

Choose a reason for hiding this comment

Uh oh!

mdegat01 May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

agners May 18, 2026

Choose a reason for hiding this comment

Uh oh!

agners May 18, 2026

Choose a reason for hiding this comment

Uh oh!

mdegat01 May 19, 2026

Choose a reason for hiding this comment

Uh oh!

mdegat01 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

mdegat01 May 8, 2026 •

edited

Loading