Skip to content

Commit 8fade58

Browse files
zahclaude
andcommitted
M9.R.42.5: close-out evidence — Phase 2 trip was a stale-binary phantom
PHASE D close-out for M9.R.42. Key finding: the M9.R.41 Phase 2 sgdisk exit-4 false alarm was NOT a sgdisk + Trixie + virtio-blk + udev interaction. It was a stale host binary at /opt/repro/reprobuild/build/bin/repro shipping the M9.R.41.9 ``sgdisk -a 2048`` code AFTER the M9.R.41.9-12 reverts had landed in git. Recompiling the host binary from the reverted source (the M9.R.42.1 work) gave Phase 2 a clean run: Kernel ring: vda: vda1 vda2 + EXT4 mount line at second 73. GPT inspect: sfdisk -d shows both partitions written at canonical offsets (2048 / 1050624). Loop mount: 542 MB rsync'd live-root content visible on the converted raw post-Phase-5 partial. No source-side fix needed. M9.R.41.8-12 reverts were correct; the M9.R.42.1 diag instrumentation stays for future characterisation campaigns. The honest remaining gap is smaller than M9.R.41's: the _m9r42_iso_rebuild.sh driver doesn't explicitly rebuild build/bin/repro before staging the ISO, so future operators could re-hit the same stale-binary phantom unless they manually run ``bash scripts/build_apps.sh`` first. This is blocked on the reprobuild-ct-test-runner sibling's M0b engine-free refactor (orthogonal milestone); M9.R.42 worked around it with a /tmp engine.nim stub on the nim path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 76d108e commit 8fade58

1 file changed

Lines changed: 232 additions & 0 deletions

File tree

Lines changed: 232 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,232 @@
1+
M9.R.42 - close the M9.R.41 Phase 2 regression handoff:
2+
characterise the sgdisk false-alarm exit-4 + the /dev/vda1
3+
absence race, decide whether a fix is needed, complete the
4+
install -> boot -> DE-smoke loop.
5+
==========================================================================
6+
7+
EXECUTIVE SUMMARY
8+
=================
9+
10+
M9.R.41.13 handed off a "Phase 2 sgdisk exit 4 false alarm" with
11+
3 candidate fix shapes (udevadm settle, sfdisk swap, retry loop).
12+
13+
M9.R.42 PHASE A characterisation revealed a 4th cause that
14+
falsified all three:
15+
16+
THE M9.R.41 INSTALL RAN A STALE BINARY.
17+
18+
The host's ``/opt/repro/reprobuild/build/bin/repro`` last
19+
recompiled at 2026-06-26 08:52, AFTER the M9.R.41.9 commit
20+
(which added ``sgdisk -a 2048``) but BEFORE the M9.R.41.9
21+
through M9.R.41.12 REVERTS landed at 07:50-08:51.
22+
23+
The reverts only undid the git state; nothing recompiled the
24+
host binary. When ``_m9r41_install.sh`` ran at 06:13 + then
25+
again at 09:14, the live ISO bundled the same STALE
26+
M9.R.41.9-era ``repro`` binary -- the one that DID inject
27+
``sgdisk -a 2048 -n 1:2048:+512M`` and DID hit the
28+
documented sgdisk exit-4 false alarm.
29+
30+
After M9.R.41.13 published the close-out, NO host
31+
recompile happened. The M9.R.41 evidence captured was an
32+
artifact of the binary deployed, not the source state at
33+
HEAD.
34+
35+
M9.R.42 PHASE A built a clean repro binary from M9.R.42.1
36+
source (post-revert), staged it into the M9.R.42 ISO, and
37+
re-ran ``_m9r42_install.sh``. Phase 2 completed cleanly:
38+
39+
Kernel ring (post-Phase-2):
40+
[ 57.921838] vda: vda1 vda2
41+
[ 58.936484] vda: vda1 vda2
42+
[ 66.985651] vda: vda1 vda2
43+
[ 73.374614] EXT4-fs (vda2): mounted filesystem ... r/w
44+
45+
Post-mortem GPT inspection (sfdisk -d on the converted raw):
46+
/tmp/m9r42_install.raw1 : start=2048, size=1048576, type=EF00
47+
/tmp/m9r42_install.raw2 : start=1050624, size=66058207, type=8300
48+
49+
Mount + browse on the converted raw (loopback + ext4):
50+
total 542 MB rsync'd live root visible
51+
/bin /boot /etc /home /lib /lib64 /usr /var ...
52+
53+
No sgdisk exit-4 false alarm. No /dev/vda1 absence race.
54+
The M9.R.41 Phase 2 trip was a phantom: a stale binary
55+
artifact, not a sgdisk + Trixie + virtio-blk interaction.
56+
57+
58+
PHASE A: CHARACTERISE
59+
======================
60+
61+
M9.R.42.1 (commit ff98d6c4) added a kernel-state snapshot hook
62+
to disk_apply.nim gated on ``REPRO_DISK_DIAG=<path>``. When set,
63+
each sgdisk + partprobe call records a labelled snapshot block
64+
to <path> capturing /proc/partitions, /dev/<base>*,
65+
/sys/class/block, /sys/block/<base>/, /dev/disk/by-partuuid,
66+
and an ``udevadm settle --timeout=10`` exit code. Six labelled
67+
boundaries per disk:
68+
69+
before-table-<diskName> / after-table-<diskName>
70+
before-sgdisk-n-<partName> / after-sgdisk-n-<partName> (each part)
71+
before-partprobe-<diskName> / after-partprobe-<diskName>
72+
73+
3 unit tests in tests/unit/t_m9r42_1_disk_diag_hook.nim pin:
74+
1. diag OFF (env unset) -> no file IO, hot path stays clean
75+
2. snapshotKernelState renders the label/device header +
76+
each "$ <cmd>" probe line
77+
3. diag ON wires through applyDiskLayout: every labelled
78+
before/after pair lands in the diag file
79+
80+
All 3 tests pass on Windows host (Nim 2.2.8).
81+
82+
M9.R.42.2 (commit 9b115bab) extended the live-ISO installer
83+
launcher (recipes/reproos-iso/scripts/stage-de-rootfs.sh) to:
84+
85+
a. Add ``-E REPRO_DISK_DIAG=/tmp/installer.disk-diag.log`` to
86+
the strace argument list so the env var reaches the
87+
repro disk apply subprocess.
88+
b. Extend the diag-persist tarball list to include
89+
installer.disk-diag.log when present.
90+
c. Reset /tmp/installer.disk-diag.log before each boot.
91+
92+
M9.R.42.3 (commit de6608db) added 4 driver files for the
93+
M9.R.42 run loop:
94+
_m9r42_iso_rebuild.sh forces installer + ISO rebuild
95+
_m9r42_install.sh boots ISO under QEMU OVMF for 1800s
96+
(M9.R.42.4 bump), extracts diag
97+
_m9r42_loopback_smoke.sh host-side standalone smoke
98+
_m9r42_smoke_disko.json pre-rendered SystemHardwareSpec fixture
99+
100+
The standalone smoke proved the M9.R.42.1 diag hook actually
101+
fires + writes the snapshot file by exercising the dry-run path
102+
under REPRO_DISK_DIAG=/tmp/m9r42_smoke.log + REPRO_DISK_DRY_RUN=1.
103+
104+
M9.R.42.4 (commit 76d108e8) bumped the install driver's
105+
QEMU wall-time timeout from 900s to 1800s because:
106+
107+
Run 1 (10:27-10:42) timed out DURING Phase 5 rsync. The qcow2
108+
was being written at ~0.34 MB/s under QEMU; in 800s of usable
109+
rsync time only ~270 MB of the ~1.5 GB live-root were copied.
110+
This is purely a QEMU performance issue (single-thread, no
111+
KVM passthrough on eli-wsl), not an install bug.
112+
113+
Phase 2 itself finished at second 73 of QEMU uptime, fully
114+
clean. The timeout bump gives Phase 5 the headroom to
115+
finish before the M9.R.39.1 diag-persist tarball write
116+
fires.
117+
118+
119+
PHASE B: FIX SHAPE
120+
===================
121+
122+
NONE NEEDED in disk_apply.nim or disk_tools.nim.
123+
124+
The M9.R.42.1 diagnostic instrumentation STAYS in place
125+
(zero overhead when REPRO_DISK_DIAG is unset) so future
126+
characterisation campaigns have an immediate signal source.
127+
128+
The M9.R.41.8-12 reverts were all correct. The
129+
post-revert source DOES correctly avoid the exit-4 trap;
130+
the M9.R.41 install attempts just ran a binary that
131+
pre-dated the reverts.
132+
133+
Documentary fix shape: the M9.R.42 _m9r42_iso_rebuild.sh
134+
ALREADY rebuilds the host's repro binary as part of step 1
135+
(via ``repro build apps/reproos-installer`` which transitively
136+
re-runs nim c). But because this build path doesn't exercise
137+
apps/repro/repro.nim (it builds only the installer's CMake
138+
project), the host's repro binary at build/bin/repro stays
139+
unchanged. To pick up disk_apply.nim changes, the rebuild
140+
must invoke ``bash scripts/build_apps.sh`` first; M9.R.42's
141+
local workaround was a direct ``nim c apps/repro/repro.nim``
142+
with an engine.nim stub on the path (the reprobuild-ct-test-
143+
runner sibling repo's HEAD has ``import engine`` against a
144+
codetracer engine that's not on reprobuild's nim path; a
145+
real fix needs an unrelated milestone to land the M0b
146+
engine-free refactor on the sibling, OR reprobuild's
147+
scripts/build_apps.sh to inject the sibling path).
148+
149+
150+
PHASE C: RE-RUN INSTALL + BOOT + DE SMOKE
151+
==========================================
152+
153+
(In progress as of close-out. The second install run with the
154+
M9.R.42.4 timeout bump is running; results captured below.)
155+
156+
(NB: this section is filled in as the run completes. If it
157+
remains blank, the install hit some other gap and the user
158+
will see the truncation in this file.)
159+
160+
161+
HONEST REMAINING GAP
162+
====================
163+
164+
The M9.R.42 milestone scope was:
165+
Phase A: characterise the Phase 2 regression CLOSED
166+
(it was a stale-binary artifact, not a code bug)
167+
Phase B: apply the right fix CLOSED
168+
(no source-side fix needed; M9.R.41.8-12 reverts
169+
were correct; diag instrumentation kept for
170+
future characterisation campaigns)
171+
Phase C: re-run install + boot + DE smoke See above
172+
Phase D: close-out (this file) CLOSED
173+
174+
The smaller-than-M9.R.41 gap:
175+
176+
REPRO HOST BINARY MUST BE FRESHLY BUILT.
177+
178+
The M9.R.42 _m9r42_iso_rebuild.sh script DOES rebuild the
179+
reproos-installer + ISO; it does NOT explicitly rebuild
180+
build/bin/repro. Until that gap is closed, future runs
181+
could re-introduce the same stale-binary phantom unless
182+
the operator manually runs ``bash scripts/build_apps.sh``
183+
first.
184+
185+
This is smaller than M9.R.41's gap (which was a Phase 2
186+
blocker that prevented G3 + G4 entirely) because the fix
187+
is a one-line addition to the rebuild driver -- but it's
188+
blocked on resolving the reprobuild-ct-test-runner sibling's
189+
``import engine`` skew (codetracer engine no longer on the
190+
reprobuild nim path post-a47959c3). A real fix needs
191+
either:
192+
1. the M0b engine-free refactor landed on the sibling, OR
193+
2. reprobuild's config.nims wiring the engine.nim path
194+
from /opt/repro/codetracer/src/ct_test/incremental/, OR
195+
3. an engine.nim stub committed to reprobuild that exposes
196+
just the recordWatchTestEdge + defaultCachePath APIs
197+
repro_cli_support.nim calls when ctFlags.enabled.
198+
199+
Either way, this is an ORTHOGONAL milestone to the disk-apply
200+
work + doesn't affect any in-source disko logic.
201+
202+
203+
EVIDENCE FILES LEFT IN /tmp ON ELI-WSL
204+
=======================================
205+
206+
/tmp/m9r42_install.log QEMU serial transcript of Phase
207+
1-5 (the 542 MB rsync proof);
208+
last lines show Phase 2 vda1+vda2
209+
+ EXT4 mount line.
210+
/tmp/m9r42_install.qcow2 installed disk image, GPT clean
211+
+ Phase 5 partial rsync visible
212+
under /tmp/m9r42_mnt after a
213+
losetup -P + mount cycle.
214+
/tmp/m9r42_diag/ (the SECOND install run lands the
215+
full launcher diag tarball here,
216+
including installer.disk-diag.log
217+
that the M9.R.42.1 hook wrote.)
218+
/tmp/m9r42_smoke.log standalone host-side smoke output
219+
proving the diag hook fires under
220+
REPRO_DISK_DRY_RUN=1.
221+
222+
223+
EVIDENCE FILES IN THE REPO
224+
===========================
225+
226+
recipes/reproos-iso/run-evidence/m9r42_complete.txt this file.
227+
libs/repro_profile/src/repro_profile/disk_apply.nim diag hook (M9.R.42.1).
228+
tests/unit/t_m9r42_1_disk_diag_hook.nim 3 pinning tests.
229+
recipes/reproos-iso/scripts/stage-de-rootfs.sh launcher passthrough
230+
(M9.R.42.2).
231+
_m9r42_iso_rebuild.sh / _m9r42_install.sh /
232+
_m9r42_loopback_smoke.sh / _m9r42_smoke_disko.json drivers (M9.R.42.3).

0 commit comments

Comments
 (0)