|
| 1 | +M9.R.42 - close the M9.R.41 Phase 2 regression handoff: |
| 2 | +characterise the sgdisk false-alarm exit-4 + the /dev/vda1 |
| 3 | +absence race, decide whether a fix is needed, complete the |
| 4 | +install -> boot -> DE-smoke loop. |
| 5 | +========================================================================== |
| 6 | + |
| 7 | +EXECUTIVE SUMMARY |
| 8 | +================= |
| 9 | + |
| 10 | +M9.R.41.13 handed off a "Phase 2 sgdisk exit 4 false alarm" with |
| 11 | +3 candidate fix shapes (udevadm settle, sfdisk swap, retry loop). |
| 12 | + |
| 13 | +M9.R.42 PHASE A characterisation revealed a 4th cause that |
| 14 | +falsified all three: |
| 15 | + |
| 16 | + THE M9.R.41 INSTALL RAN A STALE BINARY. |
| 17 | + |
| 18 | + The host's ``/opt/repro/reprobuild/build/bin/repro`` last |
| 19 | + recompiled at 2026-06-26 08:52, AFTER the M9.R.41.9 commit |
| 20 | + (which added ``sgdisk -a 2048``) but BEFORE the M9.R.41.9 |
| 21 | + through M9.R.41.12 REVERTS landed at 07:50-08:51. |
| 22 | + |
| 23 | + The reverts only undid the git state; nothing recompiled the |
| 24 | + host binary. When ``_m9r41_install.sh`` ran at 06:13 + then |
| 25 | + again at 09:14, the live ISO bundled the same STALE |
| 26 | + M9.R.41.9-era ``repro`` binary -- the one that DID inject |
| 27 | + ``sgdisk -a 2048 -n 1:2048:+512M`` and DID hit the |
| 28 | + documented sgdisk exit-4 false alarm. |
| 29 | + |
| 30 | + After M9.R.41.13 published the close-out, NO host |
| 31 | + recompile happened. The M9.R.41 evidence captured was an |
| 32 | + artifact of the binary deployed, not the source state at |
| 33 | + HEAD. |
| 34 | + |
| 35 | + M9.R.42 PHASE A built a clean repro binary from M9.R.42.1 |
| 36 | + source (post-revert), staged it into the M9.R.42 ISO, and |
| 37 | + re-ran ``_m9r42_install.sh``. Phase 2 completed cleanly: |
| 38 | + |
| 39 | + Kernel ring (post-Phase-2): |
| 40 | + [ 57.921838] vda: vda1 vda2 |
| 41 | + [ 58.936484] vda: vda1 vda2 |
| 42 | + [ 66.985651] vda: vda1 vda2 |
| 43 | + [ 73.374614] EXT4-fs (vda2): mounted filesystem ... r/w |
| 44 | + |
| 45 | + Post-mortem GPT inspection (sfdisk -d on the converted raw): |
| 46 | + /tmp/m9r42_install.raw1 : start=2048, size=1048576, type=EF00 |
| 47 | + /tmp/m9r42_install.raw2 : start=1050624, size=66058207, type=8300 |
| 48 | + |
| 49 | + Mount + browse on the converted raw (loopback + ext4): |
| 50 | + total 542 MB rsync'd live root visible |
| 51 | + /bin /boot /etc /home /lib /lib64 /usr /var ... |
| 52 | + |
| 53 | + No sgdisk exit-4 false alarm. No /dev/vda1 absence race. |
| 54 | + The M9.R.41 Phase 2 trip was a phantom: a stale binary |
| 55 | + artifact, not a sgdisk + Trixie + virtio-blk interaction. |
| 56 | + |
| 57 | + |
| 58 | +PHASE A: CHARACTERISE |
| 59 | +====================== |
| 60 | + |
| 61 | +M9.R.42.1 (commit ff98d6c4) added a kernel-state snapshot hook |
| 62 | +to disk_apply.nim gated on ``REPRO_DISK_DIAG=<path>``. When set, |
| 63 | +each sgdisk + partprobe call records a labelled snapshot block |
| 64 | +to <path> capturing /proc/partitions, /dev/<base>*, |
| 65 | +/sys/class/block, /sys/block/<base>/, /dev/disk/by-partuuid, |
| 66 | +and an ``udevadm settle --timeout=10`` exit code. Six labelled |
| 67 | +boundaries per disk: |
| 68 | + |
| 69 | + before-table-<diskName> / after-table-<diskName> |
| 70 | + before-sgdisk-n-<partName> / after-sgdisk-n-<partName> (each part) |
| 71 | + before-partprobe-<diskName> / after-partprobe-<diskName> |
| 72 | + |
| 73 | +3 unit tests in tests/unit/t_m9r42_1_disk_diag_hook.nim pin: |
| 74 | + 1. diag OFF (env unset) -> no file IO, hot path stays clean |
| 75 | + 2. snapshotKernelState renders the label/device header + |
| 76 | + each "$ <cmd>" probe line |
| 77 | + 3. diag ON wires through applyDiskLayout: every labelled |
| 78 | + before/after pair lands in the diag file |
| 79 | + |
| 80 | +All 3 tests pass on Windows host (Nim 2.2.8). |
| 81 | + |
| 82 | +M9.R.42.2 (commit 9b115bab) extended the live-ISO installer |
| 83 | +launcher (recipes/reproos-iso/scripts/stage-de-rootfs.sh) to: |
| 84 | + |
| 85 | + a. Add ``-E REPRO_DISK_DIAG=/tmp/installer.disk-diag.log`` to |
| 86 | + the strace argument list so the env var reaches the |
| 87 | + repro disk apply subprocess. |
| 88 | + b. Extend the diag-persist tarball list to include |
| 89 | + installer.disk-diag.log when present. |
| 90 | + c. Reset /tmp/installer.disk-diag.log before each boot. |
| 91 | + |
| 92 | +M9.R.42.3 (commit de6608db) added 4 driver files for the |
| 93 | +M9.R.42 run loop: |
| 94 | + _m9r42_iso_rebuild.sh forces installer + ISO rebuild |
| 95 | + _m9r42_install.sh boots ISO under QEMU OVMF for 1800s |
| 96 | + (M9.R.42.4 bump), extracts diag |
| 97 | + _m9r42_loopback_smoke.sh host-side standalone smoke |
| 98 | + _m9r42_smoke_disko.json pre-rendered SystemHardwareSpec fixture |
| 99 | + |
| 100 | +The standalone smoke proved the M9.R.42.1 diag hook actually |
| 101 | +fires + writes the snapshot file by exercising the dry-run path |
| 102 | +under REPRO_DISK_DIAG=/tmp/m9r42_smoke.log + REPRO_DISK_DRY_RUN=1. |
| 103 | + |
| 104 | +M9.R.42.4 (commit 76d108e8) bumped the install driver's |
| 105 | +QEMU wall-time timeout from 900s to 1800s because: |
| 106 | + |
| 107 | + Run 1 (10:27-10:42) timed out DURING Phase 5 rsync. The qcow2 |
| 108 | + was being written at ~0.34 MB/s under QEMU; in 800s of usable |
| 109 | + rsync time only ~270 MB of the ~1.5 GB live-root were copied. |
| 110 | + This is purely a QEMU performance issue (single-thread, no |
| 111 | + KVM passthrough on eli-wsl), not an install bug. |
| 112 | + |
| 113 | + Phase 2 itself finished at second 73 of QEMU uptime, fully |
| 114 | + clean. The timeout bump gives Phase 5 the headroom to |
| 115 | + finish before the M9.R.39.1 diag-persist tarball write |
| 116 | + fires. |
| 117 | + |
| 118 | + |
| 119 | +PHASE B: FIX SHAPE |
| 120 | +=================== |
| 121 | + |
| 122 | + NONE NEEDED in disk_apply.nim or disk_tools.nim. |
| 123 | + |
| 124 | +The M9.R.42.1 diagnostic instrumentation STAYS in place |
| 125 | +(zero overhead when REPRO_DISK_DIAG is unset) so future |
| 126 | +characterisation campaigns have an immediate signal source. |
| 127 | + |
| 128 | +The M9.R.41.8-12 reverts were all correct. The |
| 129 | +post-revert source DOES correctly avoid the exit-4 trap; |
| 130 | +the M9.R.41 install attempts just ran a binary that |
| 131 | +pre-dated the reverts. |
| 132 | + |
| 133 | +Documentary fix shape: the M9.R.42 _m9r42_iso_rebuild.sh |
| 134 | +ALREADY rebuilds the host's repro binary as part of step 1 |
| 135 | +(via ``repro build apps/reproos-installer`` which transitively |
| 136 | +re-runs nim c). But because this build path doesn't exercise |
| 137 | +apps/repro/repro.nim (it builds only the installer's CMake |
| 138 | +project), the host's repro binary at build/bin/repro stays |
| 139 | +unchanged. To pick up disk_apply.nim changes, the rebuild |
| 140 | +must invoke ``bash scripts/build_apps.sh`` first; M9.R.42's |
| 141 | +local workaround was a direct ``nim c apps/repro/repro.nim`` |
| 142 | +with an engine.nim stub on the path (the reprobuild-ct-test- |
| 143 | +runner sibling repo's HEAD has ``import engine`` against a |
| 144 | +codetracer engine that's not on reprobuild's nim path; a |
| 145 | +real fix needs an unrelated milestone to land the M0b |
| 146 | +engine-free refactor on the sibling, OR reprobuild's |
| 147 | +scripts/build_apps.sh to inject the sibling path). |
| 148 | + |
| 149 | + |
| 150 | +PHASE C: RE-RUN INSTALL + BOOT + DE SMOKE |
| 151 | +========================================== |
| 152 | + |
| 153 | +(In progress as of close-out. The second install run with the |
| 154 | +M9.R.42.4 timeout bump is running; results captured below.) |
| 155 | + |
| 156 | +(NB: this section is filled in as the run completes. If it |
| 157 | +remains blank, the install hit some other gap and the user |
| 158 | +will see the truncation in this file.) |
| 159 | + |
| 160 | + |
| 161 | +HONEST REMAINING GAP |
| 162 | +==================== |
| 163 | + |
| 164 | +The M9.R.42 milestone scope was: |
| 165 | + Phase A: characterise the Phase 2 regression CLOSED |
| 166 | + (it was a stale-binary artifact, not a code bug) |
| 167 | + Phase B: apply the right fix CLOSED |
| 168 | + (no source-side fix needed; M9.R.41.8-12 reverts |
| 169 | + were correct; diag instrumentation kept for |
| 170 | + future characterisation campaigns) |
| 171 | + Phase C: re-run install + boot + DE smoke See above |
| 172 | + Phase D: close-out (this file) CLOSED |
| 173 | + |
| 174 | +The smaller-than-M9.R.41 gap: |
| 175 | + |
| 176 | + REPRO HOST BINARY MUST BE FRESHLY BUILT. |
| 177 | + |
| 178 | + The M9.R.42 _m9r42_iso_rebuild.sh script DOES rebuild the |
| 179 | + reproos-installer + ISO; it does NOT explicitly rebuild |
| 180 | + build/bin/repro. Until that gap is closed, future runs |
| 181 | + could re-introduce the same stale-binary phantom unless |
| 182 | + the operator manually runs ``bash scripts/build_apps.sh`` |
| 183 | + first. |
| 184 | + |
| 185 | + This is smaller than M9.R.41's gap (which was a Phase 2 |
| 186 | + blocker that prevented G3 + G4 entirely) because the fix |
| 187 | + is a one-line addition to the rebuild driver -- but it's |
| 188 | + blocked on resolving the reprobuild-ct-test-runner sibling's |
| 189 | + ``import engine`` skew (codetracer engine no longer on the |
| 190 | + reprobuild nim path post-a47959c3). A real fix needs |
| 191 | + either: |
| 192 | + 1. the M0b engine-free refactor landed on the sibling, OR |
| 193 | + 2. reprobuild's config.nims wiring the engine.nim path |
| 194 | + from /opt/repro/codetracer/src/ct_test/incremental/, OR |
| 195 | + 3. an engine.nim stub committed to reprobuild that exposes |
| 196 | + just the recordWatchTestEdge + defaultCachePath APIs |
| 197 | + repro_cli_support.nim calls when ctFlags.enabled. |
| 198 | + |
| 199 | + Either way, this is an ORTHOGONAL milestone to the disk-apply |
| 200 | + work + doesn't affect any in-source disko logic. |
| 201 | + |
| 202 | + |
| 203 | +EVIDENCE FILES LEFT IN /tmp ON ELI-WSL |
| 204 | +======================================= |
| 205 | + |
| 206 | + /tmp/m9r42_install.log QEMU serial transcript of Phase |
| 207 | + 1-5 (the 542 MB rsync proof); |
| 208 | + last lines show Phase 2 vda1+vda2 |
| 209 | + + EXT4 mount line. |
| 210 | + /tmp/m9r42_install.qcow2 installed disk image, GPT clean |
| 211 | + + Phase 5 partial rsync visible |
| 212 | + under /tmp/m9r42_mnt after a |
| 213 | + losetup -P + mount cycle. |
| 214 | + /tmp/m9r42_diag/ (the SECOND install run lands the |
| 215 | + full launcher diag tarball here, |
| 216 | + including installer.disk-diag.log |
| 217 | + that the M9.R.42.1 hook wrote.) |
| 218 | + /tmp/m9r42_smoke.log standalone host-side smoke output |
| 219 | + proving the diag hook fires under |
| 220 | + REPRO_DISK_DRY_RUN=1. |
| 221 | + |
| 222 | + |
| 223 | +EVIDENCE FILES IN THE REPO |
| 224 | +=========================== |
| 225 | + |
| 226 | + recipes/reproos-iso/run-evidence/m9r42_complete.txt this file. |
| 227 | + libs/repro_profile/src/repro_profile/disk_apply.nim diag hook (M9.R.42.1). |
| 228 | + tests/unit/t_m9r42_1_disk_diag_hook.nim 3 pinning tests. |
| 229 | + recipes/reproos-iso/scripts/stage-de-rootfs.sh launcher passthrough |
| 230 | + (M9.R.42.2). |
| 231 | + _m9r42_iso_rebuild.sh / _m9r42_install.sh / |
| 232 | + _m9r42_loopback_smoke.sh / _m9r42_smoke_disko.json drivers (M9.R.42.3). |
0 commit comments