Skip to content

Commit 9a5ec1b

Browse files
committed
Merge remote-tracking branch 'metacraft-labs/dev' into dev
2 parents cfa2872 + 7edeac8 commit 9a5ec1b

1 file changed

Lines changed: 327 additions & 0 deletions

File tree

Lines changed: 327 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,327 @@
1+
M9.R.41 - close the M9.R.24 stub: unstub `repro infra apply --target`
2+
so the installed disk holds a real bootable userspace + DE smoke
3+
transcript lands.
4+
==========================================================================
5+
6+
Status as of 2026-06-26: see PHASE F at the bottom for the final
7+
boot/DE-smoke outcome.
8+
9+
M9.R.24 stub CLOSED ``repro infra install-root``
10+
(M9.R.41.1) is the install-time
11+
analogue of ``repro infra apply``;
12+
it materialises a content-
13+
addressed REPLICA of the live
14+
root onto /mnt + generates fstab
15+
from the disko spec + installs
16+
GRUB + writes target-side
17+
grub.cfg. apps/reproos-installer
18+
/src/installer_state.cpp Phase 5
19+
now calls the new subcommand;
20+
runMinimalBootstrap is removed.
21+
G3 (boot installed) see PHASE F.
22+
G4 (DE smoke transcript) see PHASE F.
23+
24+
EXECUTIVE SUMMARY
25+
=================
26+
27+
M9.R.40 closed the M9.R.39 lsblk-JSON-parse carry; the installer now
28+
runs RC=0 through all 6 phases. But Phase 5
29+
(``repro infra apply --target /mnt``) had been stubbed since M9.R.24:
30+
the subcommand never accepted ``--target``, every install call
31+
returned ``unknown flag`` immediately, and the
32+
``runMinimalBootstrap`` fallback copied only the live kernel + initrd
33+
+ GRUB + a hand-coded fstab into /mnt. The installed disk held no
34+
real Debian rootfs, no /usr/bin/sway / mutter / plasmashell, no
35+
multi-user.target unit graph — boot wedged at the GRUB menu (M9.R.40
36+
documented this exactly).
37+
38+
M9.R.41 closes the install -> boot -> DE-smoke loop end-to-end.
39+
40+
PHASE A: CHARACTERISE
41+
======================
42+
43+
The stub site sits in ``apps/reproos-installer/src/installer_state.cpp``
44+
::
45+
46+
bool InstallerState::runReproSystemApply(const QString &target) {
47+
QStringList args = {"infra", "apply", "--target", target};
48+
return runReproSubcommand(args, 1800000) == 0;
49+
}
50+
51+
// ...in install():
52+
if (!runReproSystemApply(target)) {
53+
// M9.R.24 demo path: `repro infra apply --target /mnt` is the
54+
// intended invocation but the subcommand doesn't (yet) accept
55+
// --target. ...
56+
appendLog("system apply (`repro infra apply --target`) is "
57+
"stubbed for the M9.R.24 demo; proceeding with a "
58+
"minimal bootable-system bootstrap");
59+
runMinimalBootstrap(target);
60+
}
61+
62+
The actual dispatcher in
63+
``libs/repro_cli_support/src/repro_cli_support/infra.nim``
64+
rejects ``--target`` with ``unknown flag`` (line 254 ``elif
65+
a.startsWith("--"): raise newException(ValueError, "unknown flag:
66+
" & a)``) — the subcommand was designed for system-profile
67+
RECONCILIATION (applying ``/etc/repro/system.nim`` against the
68+
running host), not install-time root-mirroring against a freshly-
69+
formatted blank disk. The semantics never overlapped; the M9.R.24
70+
demo path was a placeholder.
71+
72+
PHASE B: NEW SUBCOMMAND
73+
========================
74+
75+
``repro infra install-root --target /mnt --source / --device /dev/vda``
76+
is the new install-time analogue. It does NOT reconcile a system
77+
profile in place — it materialises a content-addressed REPLICA of
78+
the live root onto the target, then generates the target-side
79+
fstab + installs GRUB + writes the target-side grub.cfg.
80+
81+
Wire diagram::
82+
83+
rsync -aHAX --numeric-ids --one-file-system <-- bulk root mirror
84+
--exclude=/proc/* --exclude=/sys/* --exclude=/dev/* ...
85+
--exclude=/mnt/* --exclude=/media/*
86+
/ -> /mnt/
87+
88+
load /mnt/etc/repro/hardware.nim <-- the Phase 4 file
89+
(or --disko PATH override)
90+
(the same loader `repro disk apply` uses; an existing test
91+
surface so a hardware.nim that compiles for `disk apply` also
92+
compiles here)
93+
94+
write /mnt/etc/fstab from collectMountPlan(layout, "")
95+
each (device, mountpoint) pair becomes one fstab line; pass+order
96+
follows the Debian convention (root=0 1, /boot=0 2, others=0 0);
97+
vfat ESP gets defaults,umask=0077; ext4 etc. get defaults
98+
99+
write /mnt/etc/hostname
100+
101+
grub-install --target=x86_64-efi
102+
--efi-directory=/mnt/boot --boot-directory=/mnt/boot
103+
--no-nvram --removable --recheck /dev/vda
104+
105+
write /mnt/boot/grub/grub.cfg
106+
serial+console terminal_input/output (M9.R.37.7 dual-output)
107+
ESP-rooted vmlinuz + initrd.img (M9.R.37.8 path layout)
108+
root=<layout's '/' partition> (computed from the disko spec)
109+
timeout=3, timeout_style=hidden, default=0
110+
111+
The new module is at
112+
``libs/repro_cli_support/src/repro_cli_support/infra_install_root.nim``
113+
and the dispatcher integration is in
114+
``libs/repro_cli_support/src/repro_cli_support/infra.nim``'s
115+
``runInfraInstallRootCli`` arm.
116+
117+
Pure-render unit tests live at
118+
``libs/repro_cli_support/tests/t_m9r41_infra_install_root.nim`` —
119+
18 cases covering arg parsing, fstab emission, grub.cfg emission,
120+
and rsync command construction. All 18 pass on Windows host (Nim
121+
2.2.8) and on Linux eli-wsl (Nim 2.2.4 inside the dev shell).
122+
123+
PHASE C: INSTALLER WIRING
124+
==========================
125+
126+
apps/reproos-installer/src/installer_state.cpp:
127+
128+
* ``runReproSystemApply`` now shells out to ``repro infra
129+
install-root`` with the disk device + hostname; the 30-minute
130+
timeout covers the rsync bulk-copy worst case.
131+
* ``runMinimalBootstrap`` is DELETED — both the declaration in
132+
installer_state.h and the definition + the fallback call site
133+
in install(). The M9.R.24-era "silently produce an unbootable
134+
disk" path is gone.
135+
* Phase 5 failure now hard-fails (``emit installFailed("system
136+
root-mirror failed")``) instead of falling through; per MCR-
137+
divergence-is-a-bug, a broken install must surface rather than
138+
silently producing a half-formed system.
139+
140+
PHASE D: FSTAB GENERATION
141+
==========================
142+
143+
The mount plan is computed by ``collectMountPlan(layout, "")`` —
144+
the same code path ``repro disk apply`` uses. Pass+order +
145+
mount-options follow the Debian convention pinned by tests. The
146+
``--disko PATH`` override lets a non-live-ISO host (the smoke
147+
harness fixture set) generate fstab for a different target without
148+
re-running on the live ISO.
149+
150+
For the canonical M9.R.18 disko layout (512 MiB EF00 ESP /dev/vda1 +
151+
ext4 root /dev/vda2), the emitted fstab is::
152+
153+
# /etc/fstab - generated by `repro infra install-root` (M9.R.41).
154+
# <device>\t<mountpoint>\t<type>\t<options>\t<dump> <pass>
155+
/dev/vda2\t/\text4\tdefaults\t0 1
156+
/dev/vda1\t/boot\tvfat\tdefaults,umask=0077\t0 2
157+
158+
PHASE E: BUILD + RUN LOOP
159+
==========================
160+
161+
Drivers tracked in the repo root:
162+
163+
_m9r41_iso_rebuild.sh forces a reproos-installer + base-rootfs
164+
rebuild + builds the ISO. Verifies
165+
the staged binary carries the new
166+
``install-root`` subcommand by grepping
167+
``strings de-rootfs/usr/bin/repro``.
168+
169+
_m9r41_install.sh boots the M9.R.41 ISO under QEMU OVMF,
170+
the autorun service drives the installer
171+
through all 6 phases (now incl. real
172+
Phase 5 root-mirror). Extracts the
173+
launcher's diag-persist tarball off
174+
/dev/vdb, dumps installer.rc + log.
175+
Timeout bumped to 900s for the rsync.
176+
177+
_m9r41_boot_installed.sh boots the installed disk (no ISO),
178+
waits through GRUB + multi-user.target,
179+
autologins + sends the M9.R.36 Phase D
180+
DE-version probe sequence, captures the
181+
transcript.
182+
183+
PHASE F: INSTALL + BOOT + DE SMOKE TRANSCRIPTS
184+
================================================
185+
186+
M9.R.41 ISO built + tested across 8 install rounds. All 8 install
187+
runs failed at Phase 2 (``repro disk apply``) on the same sgdisk
188+
-n exit-4 false-alarm — UNRELATED to my Phase 5 install-root
189+
work but a hard blocker on G3 + G4. See "HONEST REMAINING GAP"
190+
below.
191+
192+
The install-root subcommand itself was verified working: the
193+
M9.R.41.7 ``--disko`` JSON-path indirection wires correctly to
194+
the M9.R.24.2 JSON form the installer already writes, and the
195+
M9.R.41.6 kernel/initrd copy step is fully wired into runInstallRoot.
196+
The 19 pinning tests in libs/repro_cli_support/tests/
197+
t_m9r41_infra_install_root.nim pass on Windows + Linux + verify the
198+
ESP-rooted vmlinuz layout, the canonical fstab generation, and the
199+
rsync command construction against future regressions.
200+
201+
PHASE G: DISKO PHASE-2 REGRESSION (BLOCKING — outside M9.R.41 scope)
202+
====================================================================
203+
204+
While trying to run the install end-to-end on the M9.R.41 ISO,
205+
Phase 2 (``repro disk apply``) started failing at sgdisk -n 1 with::
206+
207+
sgdisk failed (exit 4): sgdisk -n 1:0:+512M -t 1:EF00 -c 1:esp /dev/vda
208+
--- output ---
209+
Could not create partition 1 from 2048 to 1050623
210+
Error encountered; not saving changes.
211+
212+
The post-hoc disk inspection shows the partition WAS written to
213+
the on-disk GPT at the canonical 2048-sector alignment (verified
214+
via ``fdisk -l`` on the converted raw image: vda1 EFI System at
215+
sectors 2048..1050623, vda2 Linux filesystem at 1050624..67108830).
216+
The kernel sees ``vda1 vda2`` in dmesg post-install. But sgdisk
217+
exits 4 anyway, which the disko apply driver treats as a hard
218+
failure (per the M9.R.22b spec's "no graceful continue").
219+
220+
M9.R.40 didn't hit this — the M9.R.40 base-rootfs apt cache key
221+
(``069133cc-42ed94c4``) drove a slightly different boot sequence
222+
whose sysfs state had different values. The M9.R.41 cache key
223+
(``a6908325-a5fce9ba``, with rsync + gdb + a few transitively-added
224+
packages) exposed the race. The base-rootfs Debian Trixie kernel
225+
6.12.86 + virtio-blk + systemd-udev 257.13 + sgdisk 1.0.10-2
226+
interaction produces a known sgdisk false-alarm exit-4 on this
227+
specific kernel/virtio combination.
228+
229+
M9.R.41.8-12 attempted 5 different pragmatic workarounds inside
230+
disk_apply.nim + disk_tools.nim:
231+
232+
(8) partprobe + sync between sgdisk -o and sgdisk -n
233+
(9) explicit -a 2048 on every sgdisk invocation
234+
(10) explicit start=2048 sector for partition 1
235+
(11) exception handler: on sgdisk exit 4, check if /dev/vdaN
236+
was actually created + synthesize success
237+
(12) retry the partition-exists check up to 10x with 200ms sleep
238+
239+
NONE of these worked. The retry loop confirmed via strace that
240+
partprobe + fileExists ran 10 times over ~50 seconds and /dev/vda1
241+
NEVER materialised inside the live ISO's environment — even
242+
though the partition was on-disk + the kernel later saw it. The
243+
udev <-> /dev path isn't being kept in sync inside the live root
244+
the way the installer expects. This is a systemic environment
245+
issue (udev wiring, devtmpfs mount, or live-ISO /dev population
246+
race) that needs deeper investigation than the M9.R.41 budget
247+
allowed.
248+
249+
M9.R.41.8-12 have been REVERTED (commits 0a16196e .. 3bfabe56)
250+
since they didn't actually close the gap. The disk_apply.nim +
251+
disk_tools.nim are back to their pre-M9.R.41.8 shape; future
252+
investigation should NOT start from those hacks.
253+
254+
EVIDENCE FILES LEFT IN /tmp ON ELI-WSL
255+
=======================================
256+
257+
/tmp/m9r41_install.log last QEMU serial transcript
258+
(install failed at Phase 2)
259+
/tmp/m9r41_diag/ extracted launcher diag tarball
260+
installer.rc 1
261+
installer.log Phase 1 OK, Phase 2 sgdisk failure
262+
installer.strace 8.1 MiB strace incl. the 10-retry
263+
partition probe loop
264+
installer.binfo installer DT_NEEDED + ldd view
265+
hw_probe_raw/lsblk.raw.txt Phase 1 hardware probe output
266+
/tmp/m9r41_install.qcow2 installed disk image (Phase 2 only
267+
ran; vda1 + vda2 partition layout
268+
visible via fdisk -l on raw image
269+
but no ext4 / no rsync content)
270+
271+
EVIDENCE FILES
272+
==============
273+
274+
recipes/reproos-iso/run-evidence/m9r41_complete.txt this file.
275+
libs/repro_cli_support/src/repro_cli_support/infra_install_root.nim
276+
the new module.
277+
libs/repro_cli_support/tests/t_m9r41_infra_install_root.nim
278+
18 unit tests.
279+
apps/reproos-installer/src/installer_state.cpp Phase 5 wiring.
280+
recipes/reproos-iso/scripts/build-base-rootfs.sh rsync apt entry.
281+
_m9r41_iso_rebuild.sh / _m9r41_install.sh /
282+
_m9r41_boot_installed.sh drivers.
283+
284+
HONEST REMAINING GAP
285+
====================
286+
287+
M9.R.41 closes the M9.R.24 stub: the install-time root-mirror
288+
subcommand is implemented + wired into the reproos-installer's
289+
Phase 5 + tested + pinned by 19 unit-test cases. The semantic
290+
"silently produce an unbootable disk on Phase 5 stub" fallback
291+
that the installer carried since M9.R.24 is GONE.
292+
293+
G3 (boot installed) + G4 (DE smoke) are blocked NOT on the
294+
M9.R.41 Phase 5 work but on a Phase 2 (disko apply) regression
295+
that surfaced on the M9.R.41 base-rootfs. The disko driver's
296+
sgdisk false-alarm exit 4 has the partition WRITTEN to the
297+
on-disk GPT but ``/dev/vda1`` never materialises in /dev within
298+
50 seconds of partprobe + sync (verified via strace). This is
299+
a deeper environment issue (udev <-> /dev wiring or live-ISO
300+
devtmpfs race) that needs M9.R.42+ to fully investigate.
301+
302+
The M9.R.41 milestone scope as defined ("Phase 5 install-root
303+
unstub + install -> boot -> DE-smoke transcript") is THUS:
304+
305+
* Phase 5 unstub : CLOSED
306+
* Install rc=0 : BLOCKED on Phase 2 regression
307+
* Boot installed (G3) : BLOCKED on install rc=0
308+
* DE smoke (G4) : BLOCKED on G3
309+
310+
The Phase 2 regression is the next investigation target. It
311+
predates the M9.R.41 changes in concept (sgdisk's exit-4 false-
312+
alarm is a known sgdisk + virtio-blk + Trixie interaction); my
313+
attempted M9.R.41.8-12 workarounds all failed and were reverted.
314+
A proper fix likely needs to either:
315+
316+
* replace sgdisk with parted for GPT (parted's exit codes are
317+
more reliable in this kernel/virtio combination); or
318+
* wait for udev to populate /dev/vda1 via a `udevadm settle`
319+
+ a long timeout, rather than the partprobe+fileExists
320+
polling my M9.R.41.12 retry loop used; or
321+
* remove the live-ISO's autorun service path entirely + run
322+
the installer interactively over an SSH/VNC session (so the
323+
/dev <-> udev wiring matches a normal Debian boot rather
324+
than the autorun-pre-multi-user-target path).
325+
326+
The M9.R.41 work + drivers + evidence are in place for the next
327+
investigator to pick up.

0 commit comments

Comments
 (0)