Description
Lets use this issue to track quirks inherent to booting an OCI image. Hopefully, when this issue closes it will be possible for someone to boot an image made with FROM fedora:40
First lets begin with some background about the quirks
Background
Currently, an OSTree based system can only boot an OSTree commit. OSTree commits are essentially a serialization format for a filesystem, such as a tarball, with the benefit of being able to be deduplicated on a file level.
To make that directory bootable and memoryless ("without hysterisis"), the OSTree project contains a variety setup steps, in which e.g., initramfs is generated and placed in /usr/lib
, /etc
files are moved to /usr/etc
etc.
These steps are done using the tool rpm-ostree
using its image generation backend and can currently only be done exclusively with that tool. In addition, rpm-ostree contains a couple of systemd services that fixup OS quirks (e.g., generating /var
from a location called var factory).
Then, the filesystem is wrapped into a commit, and placed into an HTTP2 enabled server, where users can download new system files when an update happens.
While revolutionary, this system had the following disadvantages:
- Cannot keep up with internet speeds. Regardless of whether HTTP2 is used, performing random file requests an an HTTP host is CPU intensive.
- Not possible to extend
- The tree file format, while logical, is very hard to adopt.
OCI extension
Therefore, ostree-rs-ext
was developed with a new serialization format, which converts an OSTree commit to an OCI image. This standard embeds the OSTree commit as an OSTree repository with xattr format in the /sysroot/ostree
directory. Then, as the commit is written to the tar stream, the ostree files are hardlinked to the location they would have in the system (e.g., /usr/etc OSTree files are hardlinked to /etc).
The benefit of this format is that it makes it possible to run the result as a container and extend it.
This is why Bazzite is possible.
A trivial compression format splits this across 64 layers to make it easier to download and make some bandwidth savings possible.
When rpm-ostree receives that image, it first checks if it is a commit that has not been extended. If it is not, it imports it as usual. If has been extended, it imports the OSTree layers as an original "base" commit. The directory permissions are also sourced by the commit, which might and are different in the final container.
Then, for the extension layers, OSTree converts them to small commits on the fly, by using the base commit for SELinux labelling and moving /etc files to /usr/etc. This means that any extensions added over OCI have not been postprocessed and have quirks.
For example, the /etc/passwd file has drift. And since only the base commit is used for SELinux labelling, any package additions with custom SELinux rules break.
And of course, if there is no base commit, rpm-ostree will not load the image.
Bootc
Now, bootc comes along and formalizes the notion of OCI as OS images. Initially, it uses ostree-rs-ext
to do the unencapsulation. However, soon it will use podman to pull and expand the container, which is then fed to OSTree (bootc-dev/bootc#215). This solves the SELinux issues but introduces a set of new ones.
The codebase of that PR was referenced when building rechunk and, surprisingly, the resulting image did not boot. Therefore, when that PR merges bootc will stop being able to boot extended images.
Why?
A lot of minor reasons.
Because the OCI container might have wrong permissions in certain systemd dirs which make it fail to boot. Maybe the container has both an /etc
and /usr/etc
dir, which OSTree does not like at all, but due to the way rpm-ostree is implemented right now it works (/etc
files are transparently merged to /usr/etc
). Maybe the polkitd folder lost the polkitd group and broke. Podman rootless may break because newuidmap
has broken capabilities. And so on (see https://github.com/hhd-dev/rechunk/blob/master/1_prune.sh) with even more quirks we do not know about.
TLDR
In order for FROM fedora:40
to be possible, the following need to happen:
- The postprocessing applied by rpm-ostree needs to be documented
- The loss of attributes needs to be documented (file capabilities, xattrs, SELinux)
- In case attributes are missing, before the image is deployed it needs to be "quirked" to have correct permissions (e.g., adding polkitd to
/usr/etc/polkit-1/rules.d
) - Both bootc (when deploying arbitrary images) and rechunk (when preparing OSTree commits) need to implement them so that there is no drift between the two implementations (e.g., users can skip rechunk when testing an image and deploy it straight with bootc).
Of course, there is still value in using ostree encapsulated commits in a bootc world:
- Maintains compat with rpm-ostree
- Lower layer invalidation means less lookups when updating AND less committed files to OSTree
- There is no need for unrolling the original image for SELinux labelling
- There is no need for quirking the original image, which takes time
- Composefs can be precomputed
- Users can still extend the image arbitrarily
For most users that not developers, it does not make sense for them to have to eat the update cost for distro maintainer DX, especially when rechunk can fixup the image in 7 min.
Tagging @cgwalters as the discussion with bootc-dev/bootc#215 affects bootc
Activity