Skip to content

DLPX-97456 use non-recursive bind for /domain0 in upgrade container#871

Open
prakashsurya wants to merge 2 commits into
developfrom
projects/norbind-domain0
Open

DLPX-97456 use non-recursive bind for /domain0 in upgrade container#871
prakashsurya wants to merge 2 commits into
developfrom
projects/norbind-domain0

Conversation

@prakashsurya

@prakashsurya prakashsurya commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Problem

The upgrade container is created with Bind=/domain0 in its .nspawn file, and Bind= is recursive by default, so every host mount underneath /domain0 is inherited into the container's mount namespace. On a large engine that's tens of thousands of per-VDB mounts (ESCL-6013: ~14.4k), all tracked as mount units by the container's systemd.

Starting with Ubuntu 24.04 (systemd 255), daemon-reload is quadratic in the number of units, so each of the ~136 postinst-triggered reloads during the in-container package phase takes 80-636s; the verify spends ~6.4h of a ~6.8h job inside reloads. Further, while PID 1 is reloading it doesn't service D-Bus, so bus clients time out after 25s; this can fail the verify outright (e.g. systemctl enable in a postinst script), and blocks support from accessing the container (machinectl shell, systemd-run --machine).

Solution

The solution taken in this PR is to use the norbind option on the /domain0 bind, so that only the top-level /domain0 filesystem is mounted in the container, and the per-VDB child mounts are excluded. The norbind option dates to systemd v233, so it's supported by the nspawn on the oldest hosts we upgrade from (Ubuntu 20.04 / systemd 245).

Note the bind remains propagation-slave, so mounts created on the host after container start still propagate in; that's bounded by host churn during the verify window (tens of mounts, not 14k). It's not clear that's worth eliminating (e.g. by making the bind source private), so I've left it alone for now; we can always circle back if it proves to be a problem.

Also note, with this change, the per-VDB mountpoints appear as empty directories inside the container; whether the upgrade-verify JAR reads anything below those paths is being confirmed as part of DLPX-97456.

Testing Done

Pre-push (with upgrade testing enabled; BUILD_ARTIFACTS=ALL, RUN_TESTS=True, TEST_UPGRADE_FROM_VERSION=2026.4.0.0):
https://selfservice-jenkins.eng-tools-prd.aws.delphixcloud.com/job/appliance-build-orchestrator-pre-push/14212/

Additionally, validated manually on a dlpx-2025.5.0.2 VM with 14k filesystems mounted on the host under /domain0, creating/starting an upgrade container with these scripts, before and after this change:

                          Bind=/domain0 (before)     Bind=/domain0:/domain0:norbind (after)
mounts inside container:  14,051                     51
daemon-reload:            144.6s / 105.9s / 118.6s   0.45s / 0.28s / 0.27s

/domain0 itself is still mounted and readable inside the container after the change. For comparison, a dlpx-24.0.0.0 VM (systemd 245, where the reload cost is linear) reloads in ~6.4s at the same mount count, which matches the behavior seen in the customer's container journal prior to the in-container systemd upgrade.

Also verified through the upgrade API on a fresh dlpx-release 2026.3.0.0 engine: applied this change to /var/dlpx-update/latest/upgrade-container, created two child mounts under /domain0 to stand in for per-VDB mounts (a tmpfs at /domain0/norbind-test holding a MARKER file, plus the existing /domain0/repave/metadata/snapshot-based zfs mount), then kicked off a DEFERRED upgrade to 2026.4.0.0 via POST .../system/version/<ref>/apply. The verify phase created the container with the norbind bind, so comparing the host and container mount namespaces .. e.g.

# host (/proc/1/mountinfo)              # upgrade container (/proc/<leader>/mountinfo)
/domain0                                /domain0
/domain0/repave/metadata/...            (repave + norbind-test children absent)
/domain0/norbind-test (tmpfs)

/domain0 is mounted and readable in the container, while neither host child mount appears in the container's namespace; the MARKER file on the tmpfs is not visible inside the container either, confirming the child mount is excluded rather than just empty. The verify completed successfully (status: VERIFIED). The DEFERRED apply was then blocked only by the unrelated upgrade.platform.minimum.memory check (12GB required, the small test VM had ~7GB), which is independent of this change.

🤖 Generated with Claude Code

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@prakashsurya prakashsurya enabled auto-merge (rebase) June 10, 2026 21:45
@sebroy

sebroy commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

I understand the problem, but I'm missing a bit of context.

Why bother bind-mounting domain0 into the container at all if we're not mounting any other filesystem inside domain0 (because afaik, there are actually no files that live in the root domain0 filesystem)? Does the upgrade code explicitly mount other things in there? Why couldn't it do that using a regular directory instead of a bind-mounted domain0?

@prakashsurya

Copy link
Copy Markdown
Contributor Author

@sebroy you're likely correct.. quick test, in a container, without that bind mount:

# /opt/delphix/server/bin/dx_manage_pg clone -s MDS-test
Cloning dataset...
done.

# mount | grep domain0
domain0/MDS-test on /domain0/MDS-test type zfs (rw,nosuid,nodev,noexec,relatime,xattr,noacl,casesensitive)

I'm not certain what creates the /domain0 to begin with, but the container had it already when it was created.. so, this probably should work.. maybe the bind mount was never necessary to begin with..

The previous commit switched the "/domain0" bind to "norbind" to keep
the tens of thousands of per-VDB child mounts from being inherited into
the upgrade container, where they made systemd's "daemon-reload"
(quadratic on Ubuntu 24.04 / systemd 255) take hours during the
in-container package phase.

As Seb pointed out on the PR, the bind isn't necessary at all. The root
"/domain0" filesystem holds no files of its own; the data the
"upgrade-verify" JAR reads (e.g. the MDS database) lives in child
datasets that are cloned and mounted into "/domain0" within the
container's own mount namespace at verify time. Since zfs is available
in the container (via the "/dev/zfs" bind), those mounts land in the
container without "/domain0" needing to be exposed from the host.

So drop the "/domain0" bind entirely, which avoids both the per-VDB
mounts and the cost of the bind itself.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@prakashsurya

Copy link
Copy Markdown
Contributor Author

oh, but it's failing to start:

root@ip-10-110-232-69:~# /opt/delphix/server/bin/dx_manage_pg start -s MDS-test
Starting postgres on 54441...
renaming /domain0/MDS-test/db/postmaster.pid to /domain0/MDS-test/db/postmaster.pid.original
Job for delphix-postgres@MDS-test.service failed because the control process exited with error code.
See "systemctl status delphix-postgres@MDS-test.service" and "journalctl -xeu delphix-postgres@MDS-test.service" for details.

dx_manage_pg: failed to start delphix-postgres@MDS-test
Jun 10 22:32:02 ip-10-110-232-69 svc-postgres[3406]: Configuring pg data /domain0/MDS-test/db true
Jun 10 22:32:02 ip-10-110-232-69 svc-postgres[3410]: cat: /domain0/MDS-test/db/PG_VERSION: Permission denied
Jun 10 22:32:02 ip-10-110-232-69 svc-postgres[3408]: dx_pg_upgrade: could not read /domain0/MDS-test/db/PG_VERSION
Jun 10 22:32:02 ip-10-110-232-69 svc-postgres[3406]: dx_pg_pre_start: failed to upgrade postgres

I can read that file manually, though:

# cat /domain0/MDS-test/db/PG_VERSION
14

hm..

@prakashsurya

Copy link
Copy Markdown
Contributor Author

looks like without the bind mount, the permissions on /domain0 in the container aren't right:

# ls -l / | grep domain0
drwxr-x---    3 root root   3 Jun 10 22:47 domain0

# sudo -u postgres cat /domain0/MDS-tests/db/PG_VERSION
cat: /domain0/MDS-tests/db/PG_VERSION: Permission denied

we could probably solve this with a chmod somewhere..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

2 participants