DLPX-97456 use non-recursive bind for /domain0 in upgrade container#871
DLPX-97456 use non-recursive bind for /domain0 in upgrade container#871prakashsurya wants to merge 2 commits into
Conversation
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
|
I understand the problem, but I'm missing a bit of context. Why bother bind-mounting domain0 into the container at all if we're not mounting any other filesystem inside domain0 (because afaik, there are actually no files that live in the root domain0 filesystem)? Does the upgrade code explicitly mount other things in there? Why couldn't it do that using a regular directory instead of a bind-mounted domain0? |
|
@sebroy you're likely correct.. quick test, in a container, without that bind mount: I'm not certain what creates the |
The previous commit switched the "/domain0" bind to "norbind" to keep the tens of thousands of per-VDB child mounts from being inherited into the upgrade container, where they made systemd's "daemon-reload" (quadratic on Ubuntu 24.04 / systemd 255) take hours during the in-container package phase. As Seb pointed out on the PR, the bind isn't necessary at all. The root "/domain0" filesystem holds no files of its own; the data the "upgrade-verify" JAR reads (e.g. the MDS database) lives in child datasets that are cloned and mounted into "/domain0" within the container's own mount namespace at verify time. Since zfs is available in the container (via the "/dev/zfs" bind), those mounts land in the container without "/domain0" needing to be exposed from the host. So drop the "/domain0" bind entirely, which avoids both the per-VDB mounts and the cost of the bind itself. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
oh, but it's failing to start: I can read that file manually, though: hm.. |
|
looks like without the bind mount, the permissions on we could probably solve this with a |
Problem
The upgrade container is created with
Bind=/domain0in its.nspawnfile, andBind=is recursive by default, so every host mount underneath/domain0is inherited into the container's mount namespace. On a large engine that's tens of thousands of per-VDB mounts (ESCL-6013: ~14.4k), all tracked as mount units by the container's systemd.Starting with Ubuntu 24.04 (systemd 255),
daemon-reloadis quadratic in the number of units, so each of the ~136 postinst-triggered reloads during the in-container package phase takes 80-636s; the verify spends ~6.4h of a ~6.8h job inside reloads. Further, while PID 1 is reloading it doesn't service D-Bus, so bus clients time out after 25s; this can fail the verify outright (e.g.systemctl enablein a postinst script), and blocks support from accessing the container (machinectl shell,systemd-run --machine).Solution
The solution taken in this PR is to use the
norbindoption on the/domain0bind, so that only the top-level/domain0filesystem is mounted in the container, and the per-VDB child mounts are excluded. Thenorbindoption dates to systemd v233, so it's supported by the nspawn on the oldest hosts we upgrade from (Ubuntu 20.04 / systemd 245).Note the bind remains propagation-slave, so mounts created on the host after container start still propagate in; that's bounded by host churn during the verify window (tens of mounts, not 14k). It's not clear that's worth eliminating (e.g. by making the bind source private), so I've left it alone for now; we can always circle back if it proves to be a problem.
Also note, with this change, the per-VDB mountpoints appear as empty directories inside the container; whether the upgrade-verify JAR reads anything below those paths is being confirmed as part of DLPX-97456.
Testing Done
Pre-push (with upgrade testing enabled;
BUILD_ARTIFACTS=ALL,RUN_TESTS=True,TEST_UPGRADE_FROM_VERSION=2026.4.0.0):https://selfservice-jenkins.eng-tools-prd.aws.delphixcloud.com/job/appliance-build-orchestrator-pre-push/14212/
Additionally, validated manually on a
dlpx-2025.5.0.2VM with 14k filesystems mounted on the host under/domain0, creating/starting an upgrade container with these scripts, before and after this change:/domain0itself is still mounted and readable inside the container after the change. For comparison, adlpx-24.0.0.0VM (systemd 245, where the reload cost is linear) reloads in ~6.4s at the same mount count, which matches the behavior seen in the customer's container journal prior to the in-container systemd upgrade.Also verified through the upgrade API on a fresh
dlpx-release2026.3.0.0engine: applied this change to/var/dlpx-update/latest/upgrade-container, created two child mounts under/domain0to stand in for per-VDB mounts (atmpfsat/domain0/norbind-testholding aMARKERfile, plus the existing/domain0/repave/metadata/snapshot-basedzfs mount), then kicked off aDEFERREDupgrade to2026.4.0.0viaPOST .../system/version/<ref>/apply. The verify phase created the container with thenorbindbind, so comparing the host and container mount namespaces .. e.g./domain0is mounted and readable in the container, while neither host child mount appears in the container's namespace; theMARKERfile on thetmpfsis not visible inside the container either, confirming the child mount is excluded rather than just empty. The verify completed successfully (status: VERIFIED). TheDEFERREDapply was then blocked only by the unrelatedupgrade.platform.minimum.memorycheck (12GB required, the small test VM had ~7GB), which is independent of this change.🤖 Generated with Claude Code