[PoC] Split rootfs by OhmSpectator · Pull Request #5470 · lf-edge/eve

OhmSpectator · 2025-12-02T15:21:24Z

Description

This PR introduces the split-rootfs proof of concept for EVE. The build now produces two LinuxKit images - a minimal “bootstrap” rootfs with only boot-critical services and a “pkgs” rootfs containing non-critical services - and exposes make targets to build/run them either independently or together. Pillar gains the extsloader agent that discovers pkgs.img, mounts it, and starts its services via containerd so they appear in eve list. Documentation in docs/SPLIT-ROOTFS.md describes the architecture, workflows, and validation steps for the experiment.

PR dependencies

None.

How to test and validate this PR

Build both rootfs images and a bootstrap live disk:

make pkgs eve multi_rootfs live-bootstrap

Confirm the artifacts appear under dist/amd64/<version>/installer/.

Boot with pkgs injection:

make run-bootstrap-with-pkgs

Inside the EVE, confirm /persist/pkgs is mounted, extsloader is running (logread | grep extsloader), and the pkgs services show up in eve status.

No automated tests cover this yet.

Changelog notes

Adds experimental split-rootfs tooling (bootstrap + pkgs images), developer run targets, and the Pillar external services loader that auto-starts pkgs.img contents.

PR Backports

16.0: No, experimental development-only change.
14.5-stable: No.
13.4-stable: No.

Checklist

I've provided a proper description
I've added the proper documentation
I've tested my PR on amd64 device
I've tested my PR on arm64 device
I've written the test verification instructions
I've set the proper labels to this PR
I've checked the boxes above, or I've provided a good reason why I didn't check them.

OhmSpectator · 2025-12-02T15:29:14Z

I'm impressed I had only one Yetus warning =D

deitch · 2025-12-02T15:34:51Z

I am quite in favour of this idea, at least in principle. @rene has raised it before as well.

I am not convinced this is the best way to solve the "300MB limit" issue; that is better handled by resizing partitions (which can be done safely, if done correctly). However, this does have the option of making the "common OS" (term created by Daniel Derksen) much more common, and then all of the "additional things" be added later or earlier. That is the real value IMO; reducing EVE's proliferation.

I have been using the term "boot+core" for the first part, and "system+app" for the second part:

boot+core = everything needed to get EVE to the point where it can communicate with a controller, download config and OS updates, and get the second parts
system+app = everything that is add on for managing the system (observability, logging, debug, etc.) or apps (k3s, runtimes, volume management, etc.)

Of course, we can be OK with any terminology.

I do have some specific disagreements with parts of the design.

Is linuxkit the best way to make the "system+app" (other parts)? I am not convinced.
Should "other parts" ("system+app") be a single monolithic squashfs? I don't think so. I think having it be container images is much more flexible, and allows us to pick and choose which pieces we want.
We should use those in erofs formats, so that we can mount them directly as is.
We should use fs-verity or dm-verity to verify those, so they cannot be changed even on disk
How does the core get access to system+app? I think it should be able to find them, but it also should be able to download them. More accurately, I think these should be designated by the controller, along with the expected digests, so that it is as immutable and verifiable as the system itself.

I have done a number of experiments on containers in erofs with verity, and it works quite well.

Summarizing: I like what you are doing, definitely bring @rene into it, as he has ideas. How we build and integrate with the "other parts" should be subject to discussion.

OhmSpectator · 2025-12-02T15:52:43Z

@deitch, it's just a PoC for one of the approaches that @rene asked me to accomplish by the end of this year. I totally like your comments and I guess you have a better vision of which tools should be used for that. There is a design doc that you can also comment on. Ping @rene in DM to get a link to it. For now I'm done with the task.

deitch · 2025-12-02T16:01:21Z

Never assume I have a better vision; just a potentially different one.

codecov · 2025-12-02T16:18:43Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 29.49%. Comparing base (2281599) to head (4522a94).
⚠️ Report is 294 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #5470      +/-   ##
==========================================
+ Coverage   19.52%   29.49%   +9.96%     
==========================================
  Files          19       18       -1     
  Lines        3021     2417     -604     
==========================================
+ Hits          590      713     +123     
+ Misses       2310     1552     -758     
- Partials      121      152      +31

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

eriknordmark · 2025-12-03T13:47:43Z

I am not convinced this is the best way to solve the "300MB limit" issue; that is better handled by resizing partitions (which can be done safely, if done correctly). However, this does have the option of making the "common OS" (term created by Daniel Derksen) much more common, and then all of the "additional things" be added later or earlier. That is the real value IMO; reducing EVE's proliferation.

Even if we had the resizing today I don't know how we would convince all users to do that operation in production before they need the next set of fixes for CVEs (which are likely to push the current image above the 300MB limit). So I think we need this tool in our collection of tools to be able to move forward (in addition to getting the resizing in place).

eriknordmark · 2025-12-12T10:21:08Z

I am not convinced this is the best way to solve the "300MB limit" issue; that is better handled by resizing partitions (which can be done safely, if done correctly).

I thought I added this comment from my phone earlier, but here we go again.

Even if we have good partition resizing support, a user probably needs some additional care (e.g., don't power off intentionally or accidentally during the resizing operation), which means they might want to pick a good time to do that.
And while they prepare for that there might be another CVE (akin to the recent runc CVE) which will grow the size of the EVE rootfs image by a more megabytes.
If I were a user/customer I would be unhappy if I couldn't apply a fix to such a CVE until I have taken that larger maintenance window and effort to do the resizing.

Thus I think we need this approach plus resizing (plus a larger default partition size at install).

deitch · 2025-12-12T11:05:04Z

Thus I think we need this approach plus resizing (plus a larger default partition size at install).

Agreed. In the long run, the resize will solve us for many years to come, especially with the new sizes. But we need this.

deitch · 2025-12-12T11:05:46Z

Even if I think implementation-wise it should be somewhat different.

shjala · 2025-12-16T10:02:08Z

This approach (or any, for that matter), if not designed from the ground up to address the security implications it may bring, can completely undermine one of EVE’s foundational security guarantees: a device left unattended cannot be permanently backdoored (at both the EVE-OS level and the application level if an encrypted partition is used).

We should use fs-verity or dm-verity to verify those, so they cannot be changed even on disk

@deitch This can’t help without root hash signing (which is another can of worms by itself). An attacker with root access or, in our case with physical disk access, can simply modify the image and then update the entire hash tree.

There are possible solutions but each have their own limitations and bring complexity :

We can use full-disk encryption with TPM-stored keys (this can include the run-time part of the os), but that then brings the risk of the device not booting at all because of PCR changes.
Build-time signing of run-time images is possible, but who owns the signing key? lf-edge? Zededa? eve-os project?
We can extend our measured boot + remote attestation to the run-time images, but then what happens when measured boot fails to match the good known state? we won't run part of the system? then we may need move all the diagnostic related code to base-os.
deliver extra run-time images signed by the controller, with the controller's public key embedded in the verified base-os?
Controller-based hash verification, base-os queries controller to verify runtime image hashes at load time (no pre-signing needed, works offline with cached approvals).
etc.

@andrewd-zededa I'm not familiar with the design of eve-k, but the same issue may apply to it ☝🏼

eriknordmark · 2025-12-19T10:26:43Z

We can extend our measured boot + remote attestation to the run-time images, but then what happens when measured boot fails to match the good known state? we won't run part of the system? then we may need move all the diagnostic related code to base-os.

@shjala
If we treat the additional EVE images the same as the rootfs image, then can't we just measure them into the appropriate PCR before we start using them?
Today we let EVE run (and attest its PCRs to the controller) so that approach doesn't have to change just because we split out parts from the rootfs image (and we can potentially extend this to other "optional" images like longhorn down the road but that is a different but related topic).

deitch · 2025-12-19T12:46:47Z

We should use fs-verity or dm-verity to verify those, so they cannot be changed even on disk

This can’t help without root hash signing (which is another can of worms by itself). An attacker with root access or, in our case with physical disk access, can simply modify the image and then update the entire hash tree

@shjala why not?

System boots
Hardware root of trust in PCR unlocks root filesystem - any changes would cause failure to unlock
Root filesystem downloads or otherwise accesses "extended root blocks" on read-write filesystem
Root filesystem has embedded in it (or downloaded from controller) digest of those blocks
Root uses fsverity/dmverity so the kernel loads those "extended root blocks" and verifies their content

With verity, any changes on the filesystem will cause the kernel to refuse to pass those through. The key is to have:

the kernel is verified (as now)
the extended blocks reader is verified (as now, just part of core root)
the extended blocks reader either includes the digests of the extended blocks built into it (and therefore already verified) or downloads from controller (which it trusts)

With the above, the extended blocks reader is ensured that the contents of the extended block never are changed, or are prevented if changed.

What did I miss?

Add dedicated bootstrap/pkgs LinuxKit templates, glue them into the build graph, and expose make targets for producing each squashfs so the new architecture can be built independently of the legacy rootfs. Signed-off-by: Nikolay Martyanov <nikolay@zededa.com>

Wire the split rootfs into developer workflows by adding live/run targets, and documenting the new entry points in the help output. Signed-off-by: Nikolay Martyanov <nikolay@zededa.com>

Introduce the external-services loader that discovers pkgs.img, mounts it, and starts services through containerd, and register the agent inside the boot sequence. Signed-off-by: Nikolay Martyanov <nikolay@zededa.com>

Add SPLIT-ROOTFS.md outlining the motivation, build pipeline, and Pillar integration for the bootstrap/pkgs experiment. Signed-off-by: Nikolay Martyanov <nikolay@zededa.com>

- Makefile: installer-split target, eve-hv-supported metadata injection - installer.yml.in: include rootfs-ext.img in installer - pkg/installer/install: read HV from CONFIG, copy ext to persist - pkg/grub: read eve-hv-type from CONFIG, write /run/eve-hv-type - storage-init: select ext4/ZFS based on CONFIG eve-hv-type - onboot.sh: propagate CONFIG eve-hv-type to runtime - extsloader: skip kube service when HV != k - rootfs_core/ext/universal.yml.in: split rootfs templates - docs/SPLIT-ROOTFS-DESIGN.md: v1.6 with POC implementation status Signed-off-by: Nikolay Martyanov <ohmspectator@gmail.com>

- gate zedkube startup in device-steps by eve-hv-type (/run/eve-hv-type, fallback /etc/eve-hv-type). - subscribe to ENClusterAppStatus in zedmanager only for kube HV. - update rootfs_ext service template to match the split/universal service set for this WIP. Signed-off-by: Nikolay Martyanov <ohmspectator@gmail.com>

Exclude /persist/pkgs and /persist/eve-services from volumemgr large-file scan to avoid oversized DiskMetric payloads in split rootfs runs. Also tighten FindLargeFiles exclude matching to directory boundaries and add a regression test. Signed-off-by: Nikolay Martyanov <ohmspectator@gmail.com>

Switch kube-only behavior from build-time assumptions to runtime checks for pillar-k running on non-k hypervisor types. - gate PVC disk metrics and PVC format bootstrap scan by runtime HV type - add shared kube runtime guard and apply it across exported kubeapi functions in //go:build k files - add runtime fail-fast checks in zedkube and kubevirt hypervisor exported entry points - reject kubevirt hypervisor selection when runtime HV type is not k - make CSI handler constructor safely fall back when kube runtime is disabled - keep node-drain behavior runtime-aware (NOTSUPPORTED on non-k) - handle DeleteNAD call-site errors explicitly This keeps split/universal images runtime-driven instead of relying on compile-time -k behavior. Signed-off-by: Nikolay Martyanov <ohmspectator@gmail.com>

Split-rootfs OTA update mechanism: - Makefile: eve-split target rewritten to avoid recursive make (fixes dirty-timestamp and monolithic rootfs rebuild issues). Uses cp instead of symlinks so Docker COPY gets real files, not dangling symlinks. - Dockerfile.in: conditional #SPLIT_ROOTFS_LABEL# for Extension disk label - baseosmgr/worker.go: WriteExtensionToPersist after WriteToPartition - baseosmgr/handleextension.go: CAS helper for Extension extraction - cas/extension.go: FindAdditionalDiskBlob for OCI manifest walking - extsloader: CAS self-heal via Puller.Pull with FilesTarget{Disks}, pubsub ProcessChange for BaseOsStatus/ContentTreeStatus subscriptions - types/locationconsts.go: shared Extension A/B path constants - Roadmap doc updated with HV-agnostic Core limitation note Eden test infrastructure: - tests/eden/run.sh: split-rootfs test entry (test #8) - tests/eden/prepare-split-rootfs-test.sh: build/push/setup/test script Tested: monolith(15.11.0) -> split OTA, CAS self-heal extracts Extension, dm-verity mount, debug/SSH container starts from Extension. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- types/extsloadertypes.go: new ExtsloaderStatus type with State enum (Starting/Ready/Failed), published via pubsub by extsloader - extsloader: publish ExtsloaderStatus via pubsub instead of only writing /run/extsloader-state.json (kept as debug aid, marked for removal) - nodeagent: subscribe to ExtsloaderStatus from extsloader, check Extension ready state before marking update TestComplete in testing window (handletimers.go:checkExtsloaderReady). If /etc/ext-verity-roothash exists but Extension not ready, testing window keeps waiting → eventual reboot → automatic rollback - base/logobjecttypes.go: register ExtsloaderStatusLogType Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- pkg/pillar/Dockerfile: add cryptsetup package (provides veritysetup) to the runtime PKGS. Required for dm-verity verification of Extension. - extsloader: remove legacy read-only loop mount fallback. dm-verity is now mandatory for Extension mounting. If veritysetup is missing or roothash not found, mount fails → extsloader reports failed → nodeagent testing window holds → rollback. - Remove "no-verity" PCR12 measurement path since verity is always used. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- tests/eden/prepare-broken-split-image.sh: builds a split-rootfs EVE image with corrupted ext-verity-roothash for rollback testing - tools/make-ext-verity.sh: dm-verity hash tree generation for Extension Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…sitions Add CleanupUnusedExtension() to baseosmgr/handleextension.go. Called when a partition transitions to "unused" state: - After successful update (handleZbootTestComplete): other slot's ext cleaned - After uninstall (doBaseOsUninstall): other partition marked unused, ext cleaned - After failed version check (doBaseOsActivate): partition marked unused, ext cleaned This prevents stale Extension images from accumulating on /persist. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

OhmSpectator requested a review from eriknordmark as a code owner December 2, 2025 15:21

OhmSpectator marked this pull request as draft December 2, 2025 15:21

OhmSpectator force-pushed the feature/split-rootfs branch from febc050 to b5a1ebb Compare December 2, 2025 15:26

OhmSpectator force-pushed the feature/split-rootfs branch from b5a1ebb to 6ab0bbb Compare February 18, 2026 11:51

github-actions bot requested review from andrewd-zededa, deitch, europaul, naiming-zededa, rene, rouming, rucoder, shjala and zedi-pramodh February 18, 2026 11:52

OhmSpectator force-pushed the feature/split-rootfs branch from 90b02fb to 4522a94 Compare February 19, 2026 13:19

OhmSpectator added 4 commits March 11, 2026 14:52

build: add bootstrap/pkgs run workflows

93087a6

Wire the split rootfs into developer workflows by adding live/run targets, and documenting the new entry points in the help output. Signed-off-by: Nikolay Martyanov <nikolay@zededa.com>

pillar: add extsloader agent

f656c31

Introduce the external-services loader that discovers pkgs.img, mounts it, and starts services through containerd, and register the agent inside the boot sequence. Signed-off-by: Nikolay Martyanov <nikolay@zededa.com>

docs: describe split rootfs POC

a5a1f63

Add SPLIT-ROOTFS.md outlining the motivation, build pipeline, and Pillar integration for the bootstrap/pkgs experiment. Signed-off-by: Nikolay Martyanov <nikolay@zededa.com>

OhmSpectator added 7 commits March 11, 2026 14:52

WIP

ab2ffc4

mkrootfs-erofs: add LinuxKit package files

0009001

WIP: checkpoint split-rootfs before master rebase

1c8ee8e

OhmSpectator force-pushed the feature/split-rootfs branch from 4522a94 to db48c64 Compare March 16, 2026 19:58

github-actions bot requested review from christoph-zededa, jsfakian, milan-zededa and uncleDecart March 16, 2026 19:59

OhmSpectator and others added 4 commits March 17, 2026 14:12

OhmSpectator force-pushed the feature/split-rootfs branch from db48c64 to 063e1b2 Compare March 17, 2026 19:36

OhmSpectator and others added 3 commits March 18, 2026 11:41

extsloader: rescan extensions on base OS updates

5b33c87

WIP split-rootfs workflow prep and extension cleanup

2efd299

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PoC] Split rootfs#5470

[PoC] Split rootfs#5470
OhmSpectator wants to merge 18 commits intolf-edge:masterfrom
OhmSpectator:feature/split-rootfs

OhmSpectator commented Dec 2, 2025

Uh oh!

OhmSpectator commented Dec 2, 2025

Uh oh!

deitch commented Dec 2, 2025

Uh oh!

OhmSpectator commented Dec 2, 2025

Uh oh!

deitch commented Dec 2, 2025

Uh oh!

codecov bot commented Dec 2, 2025 •

edited

Loading

Uh oh!

eriknordmark commented Dec 3, 2025

Uh oh!

eriknordmark commented Dec 12, 2025

Uh oh!

deitch commented Dec 12, 2025

Uh oh!

deitch commented Dec 12, 2025

Uh oh!

shjala commented Dec 16, 2025 •

edited

Loading

Uh oh!

eriknordmark commented Dec 19, 2025

Uh oh!

deitch commented Dec 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

OhmSpectator commented Dec 2, 2025

Description

PR dependencies

How to test and validate this PR

Changelog notes

PR Backports

Checklist

Uh oh!

OhmSpectator commented Dec 2, 2025

Uh oh!

deitch commented Dec 2, 2025

Uh oh!

OhmSpectator commented Dec 2, 2025

Uh oh!

deitch commented Dec 2, 2025

Uh oh!

codecov bot commented Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

eriknordmark commented Dec 3, 2025

Uh oh!

eriknordmark commented Dec 12, 2025

Uh oh!

deitch commented Dec 12, 2025

Uh oh!

deitch commented Dec 12, 2025

Uh oh!

shjala commented Dec 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eriknordmark commented Dec 19, 2025

Uh oh!

deitch commented Dec 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

codecov bot commented Dec 2, 2025 •

edited

Loading

shjala commented Dec 16, 2025 •

edited

Loading