|
| 1 | +M9.R.39 - close the M9.R.38 carry: characterise + fix live-ISO heap-corruption regression |
| 2 | +======================================================================================== |
| 3 | + |
| 4 | +Status as of 2026-06-26: PARTIAL. |
| 5 | + |
| 6 | + M9.R.38 regression CLOSED munmap_chunk()/SIGABRT path |
| 7 | + eliminated, installer now runs |
| 8 | + past QGuiApplication ctor and |
| 9 | + enters Phase 1. |
| 10 | + G2 (clean install) BLOCKED on new downstream gap: ``repro |
| 11 | + hardware probe`` calls lsblk |
| 12 | + and dies on ``lsblk JSON parse |
| 13 | + error: input(1, 1) Error: { |
| 14 | + expected``. NEW issue, unrelated |
| 15 | + to the M9.R.38 ABI mismatch. |
| 16 | + G3 (boot installed) BLOCKED on G2. |
| 17 | + G4 (DE smoke transcript) BLOCKED on G3. |
| 18 | + G5 (close-out) this file. |
| 19 | + |
| 20 | +EXECUTIVE SUMMARY |
| 21 | +================= |
| 22 | + |
| 23 | +M9.R.38 characterised a live-ISO heap-corruption crash that hit |
| 24 | +``reproos-installer`` after the Qt locale C->C.UTF-8 log line + |
| 25 | +before Phase 1, with the SIGABRT signature ``munmap_chunk(): |
| 26 | +invalid pointer``. The M9.R.38.3 nix-store-Qt6 skip didn't help, |
| 27 | +falsifying Qt6-version-mix as the primary cause. |
| 28 | + |
| 29 | +M9.R.39 closes that gap. Phase A added instrumentation, Phase B |
| 30 | +characterised the actual cause, Phase C landed the fix, Phase D |
| 31 | +ran the install and observed the installer running past the |
| 32 | +crash point and into Phase 1 (where a new lsblk-parse gap |
| 33 | +surfaced). |
| 34 | + |
| 35 | +The actual root cause was a glibc-instance mix-up: the installer |
| 36 | +binary's PT_INTERP nix-glibc-2.40-66 ld.so loaded onto a process |
| 37 | +whose libc.so.6 / libm.so.6 / libpthread.so.0 resolved via |
| 38 | +/etc/ld.so.cache to Debian's /lib/x86_64-linux-gnu/, while |
| 39 | +libdl.so.2 / libresolv.so.2 / librt.so.1 resolved to nix glibc |
| 40 | +(the only subset reachable via the RPATH-reflected /nix/store |
| 41 | +path). Two glibc instances' private heap + TLS data structures |
| 42 | +shared one process. malloc()/free() on one half corrupted the |
| 43 | +other half's bookkeeping; the first big static-init heap |
| 44 | +allocation past the QGuiApplication ctor tripped |
| 45 | +``munmap_chunk(): invalid pointer`` and SIGABRT. |
| 46 | + |
| 47 | +The fix narrows the override scope to the installer process tree: |
| 48 | +the launcher prepends the installer binary's PT_INTERP nix-glibc |
| 49 | +dir to its LD_LIBRARY_PATH via strace's ``-E`` flag (DIAG mode) |
| 50 | +or via direct invocation of the nix ld.so with ``--library-path`` |
| 51 | +(non-DIAG mode), so the wrapper chain (strace/stdbuf/env) stays |
| 52 | +on Debian libc. The installer's own process gets ALL glibc |
| 53 | +subsystems resolved from the SAME nix glibc instance. |
| 54 | + |
| 55 | +WHAT LANDED |
| 56 | +=========== |
| 57 | + |
| 58 | + M9.R.39.1 LD_DEBUG=libs + diag-persist via /dev/vdb (aa47631f) |
| 59 | + scratch disk; launcher captures every shared-lib |
| 60 | + resolution decision + per-tid kernel stacks + |
| 61 | + tars them to /dev/vdb sector 0+ before exit so |
| 62 | + the M9.R.37/M9.R.38 tmpfs-log loss problem |
| 63 | + doesn't recur. |
| 64 | + |
| 65 | + M9.R.39.2 systemd autorun unit bypasses serial-getty wedge (d7fa540b) |
| 66 | + The M9.R.39.1 FIFO + login + manual installer |
| 67 | + invocation pattern wedged on a terminfo-init loop |
| 68 | + in serial-getty's autologin chain. Replace with |
| 69 | + a ``reproos-installer-autorun.service`` systemd |
| 70 | + unit gated on ``ConditionKernelCommandLine= |
| 71 | + repro.installer.autorun=1``. Driver no longer |
| 72 | + needs a FIFO at all. |
| 73 | + |
| 74 | + M9.R.39.3 propagate REPRO_INSTALLER_AUTORUN through the (d75fe3d1) |
| 75 | + recipe wrapper. The engine's command-line |
| 76 | + wrapper in repro.nim filters outer env vars; the |
| 77 | + new var didn't reach build-iso.sh until added |
| 78 | + to the explicit prefix list. |
| 79 | + |
| 80 | + M9.R.39.4 switch autorun unit to self-gating script (06f5e3c4) |
| 81 | + ExecStart. M9.R.39.2's ConditionKernelCommandLine |
| 82 | + silently no-op'd (Wants symlink + unit file + |
| 83 | + cmdline param all verified present, yet no |
| 84 | + ``Starting`` line in the boot log). Replaced |
| 85 | + with a wrapper script that parses /proc/cmdline |
| 86 | + at run time + skips when the param is absent; |
| 87 | + poweroffs at the end unconditionally so QEMU |
| 88 | + exits cleanly. This is the first iteration that |
| 89 | + actually ran the launcher to completion under |
| 90 | + the diag harness. |
| 91 | + |
| 92 | + M9.R.39.5 prepend installer PT_INTERP glibc dir to (0c632fa5) |
| 93 | + LD_LIBRARY_PATH. The actual Phase C fix. |
| 94 | + |
| 95 | + M9.R.39.6 tracer wrapper chain must not inherit (91c41afd) |
| 96 | + LD_LIBRARY_PATH. M9.R.39.5 had strace inherit |
| 97 | + LD_LIBRARY_PATH and hit ``__nptl_change_stack_perm, |
| 98 | + version GLIBC_PRIVATE`` (Debian strace + nix |
| 99 | + libc.so.6 symbol gap, RC=127). Move to strace's |
| 100 | + -E flag (sets env on traced child only) and |
| 101 | + direct nix-ld.so invocation in non-DIAG mode. |
| 102 | + |
| 103 | +PHASE A: INSTRUMENT (M9.R.39.1 + M9.R.39.4) |
| 104 | +============================================ |
| 105 | + |
| 106 | +The M9.R.38 tmpfs-log loss problem: live-ISO's /tmp is on a |
| 107 | +squashfs+overlay tmpfs that vanishes on poweroff. M9.R.37/38's |
| 108 | +strace + kernelstacks lived there and never reached the host |
| 109 | +after QEMU exited. |
| 110 | + |
| 111 | +M9.R.39.1 fix: launcher's DIAG mode tars the diag tree |
| 112 | +(installer.strace, installer.kernelstacks, installer.lddebug, |
| 113 | +installer.log, installer.binfo, installer.rc) and dd's the |
| 114 | +gzipped tarball onto /dev/vdb's raw sectors with a |
| 115 | +``M9R39DIAGv1 SIZE=<bytes>\n`` header on sector 0. The host |
| 116 | +driver attaches a 64 MiB raw scratch disk + post-mortem reads |
| 117 | +the header + extracts the tarball. |
| 118 | + |
| 119 | +M9.R.39.4 fix: systemd autorun unit replaces the FIFO + login |
| 120 | +chain. The driver no longer needs to send input -- the live |
| 121 | +ISO boots into the launcher via systemd. |
| 122 | + |
| 123 | +Verification: M9.R.39.4 + M9.R.39.5 install run produced |
| 124 | +73 KiB diag tarball with all 6 expected files, including a |
| 125 | +3.4 MiB strace + 47 KiB lddebug + 12 KiB binfo. |
| 126 | + |
| 127 | +PHASE B: CHARACTERISE (M9.R.39.4 run) |
| 128 | +====================================== |
| 129 | + |
| 130 | +Diag tarball: /tmp/m9r39_diag_phaseB/installer.diag.tar.gz |
| 131 | +(snapshot of the M9.R.39.4 run at 03:38, RC=134 SIGABRT). |
| 132 | + |
| 133 | +Key channels: |
| 134 | + |
| 135 | + installer.log (40 bytes, the SIGABRT diagnostic itself): |
| 136 | + |
| 137 | + munmap_chunk(): invalid pointer |
| 138 | + Aborted |
| 139 | + |
| 140 | + installer.binfo ldd resolution view of /usr/bin/reproos-installer |
| 141 | + on the live ISO (M9.R.39.4 launcher's binfo dump, no |
| 142 | + M9.R.39.5 LD_LIBRARY_PATH override yet): |
| 143 | + |
| 144 | + libstdc++.so.6 -> /nix/store/xm08aqdd7pxcdhm0ak6aqb1v7hw5q6ri-gcc-14.3.0-lib/lib/libstdc++.so.6 |
| 145 | + libQt6Core.so.6 -> /opt/repro/.../qt6-base/.../lib/libQt6Core.so.6 |
| 146 | + libgcc_s.so.1 -> /nix/store/xm08aqdd7pxcdhm0ak6aqb1v7hw5q6ri-gcc-14.3.0-lib/lib/libgcc_s.so.1 |
| 147 | + libc.so.6 -> /lib/x86_64-linux-gnu/libc.so.6 (Debian) |
| 148 | + libm.so.6 -> /lib/x86_64-linux-gnu/libm.so.6 (Debian) |
| 149 | + libpthread.so.0 -> /lib/x86_64-linux-gnu/libpthread.so.0 (Debian) |
| 150 | + libdl.so.2 -> /nix/store/xx7cm72qy2c0643cm1ipngd87aqwkcdp-glibc-2.40-66/lib/libdl.so.2 |
| 151 | + libresolv.so.2 -> /nix/store/xx7cm72qy2c0643cm1ipngd87aqwkcdp-glibc-2.40-66/lib/libresolv.so.2 |
| 152 | + librt.so.1 -> /nix/store/xx7cm72qy2c0643cm1ipngd87aqwkcdp-glibc-2.40-66/lib/librt.so.1 |
| 153 | + PT_INTERP -> /nix/store/xx7cm72qy2c0643cm1ipngd87aqwkcdp-glibc-2.40-66/lib/ld-linux-x86-64.so.2 |
| 154 | + |
| 155 | + ldconfig -p showed Debian's libc.so.6 FIRST + nix glibc-2.40-66's |
| 156 | + libc.so.6 SECOND in the cache. The from-source binary's |
| 157 | + PT_INTERP nix-glibc ld.so loaded, fell through to ld.so.cache |
| 158 | + (Debian first) for libc.so.6 / libm.so.6 / libpthread.so.0, and |
| 159 | + found libdl.so.2 / libresolv.so.2 / librt.so.1 via the |
| 160 | + RPATH-reflected /nix/store glibc path (M9.R.37.5 launcher's |
| 161 | + LD_LIBRARY_PATH set those dirs). Two glibc instances' |
| 162 | + private heap + TLS metadata in one process; the first big |
| 163 | + static-init heap allocation past QGuiApplication ctor tripped |
| 164 | + ``munmap_chunk(): invalid pointer``. |
| 165 | + |
| 166 | + installer.strace shows the abort signature: |
| 167 | + |
| 168 | + 616 1782434312.742906 access("/usr/bin/qt.conf", F_OK) = -1 ENOENT |
| 169 | + 616 1782434312.744512 writev(2, ["munmap_chunk(): invalid pointer", "\n"]) |
| 170 | + 616 1782434312.749732 tgkill(616, 616, SIGABRT) = 0 |
| 171 | + 616 1782434312.751042 --- SIGABRT {si_signo=SIGABRT, si_code=SI_TKILL, si_pid=616, si_uid=0} --- |
| 172 | + 616 1782434312.756735 +++ killed by SIGABRT +++ |
| 173 | + |
| 174 | + The specific offending library is libc.so.6 -- BOTH copies |
| 175 | + are loaded into the same process. The SPECIFIC OFFENDER: |
| 176 | + /lib/x86_64-linux-gnu/libc.so.6 (Debian) |
| 177 | + /nix/store/xx7cm72qy2c0643cm1ipngd87aqwkcdp-glibc-2.40-66/lib/libc.so.6 (nix) |
| 178 | + Both have the SAME soname (libc.so.6) but DIFFERENT internal |
| 179 | + layouts -- compiled from different glibc 2.4x branches with |
| 180 | + different malloc state machines. |
| 181 | + |
| 182 | +PHASE C: FIX (M9.R.39.5 + M9.R.39.6) |
| 183 | +==================================== |
| 184 | + |
| 185 | +Per the project's "from-source propagation, no apt fallback" |
| 186 | +principle (cross-source-build/AGENTS.md), the fix shape is |
| 187 | +Option W: the launcher discovers the installer binary's |
| 188 | +PT_INTERP nix-store glibc dir + prepends it to LD_LIBRARY_PATH |
| 189 | +just for the installer process tree. Debian binaries on the |
| 190 | +live ISO (cat/head/ls/etc) keep their compatible glibc. |
| 191 | + |
| 192 | +Discovery: the launcher scans the installer binary's first |
| 193 | +4 KiB for the PT_INTERP nix-store glibc path via |
| 194 | +``head -c 4096 | grep -oE '/nix/store/[a-z0-9]+-glibc-[^/]+/lib/ld-linux-x86-64\.so\.2'``. |
| 195 | +Fallback to ``ls -d /nix/store/*-glibc-*/lib`` if header |
| 196 | +extraction fails. |
| 197 | + |
| 198 | +M9.R.39.5 placed LD_LIBRARY_PATH on the strace + stdbuf wrapper |
| 199 | +chain's env via the launcher's ``env LD_LIBRARY_PATH=... strace |
| 200 | +...`` pattern -- which caused strace + stdbuf to resolve their |
| 201 | +OWN libc.so.6 against the nix glibc + hit |
| 202 | +``undefined symbol: __nptl_change_stack_perm, |
| 203 | +version GLIBC_PRIVATE`` and exit RC=127 (M9.R.39.5 install |
| 204 | +attempt at 04:18:14). |
| 205 | + |
| 206 | +M9.R.39.6 corrected the env-scoping: DIAG mode uses strace's |
| 207 | +``-E var=val`` flag (sets env on traced child only -- strace |
| 208 | +itself sees a clean env without LD_LIBRARY_PATH and stays on |
| 209 | +Debian libc). Non-DIAG mode invokes the nix-glibc ld.so |
| 210 | +DIRECTLY via ``ld-linux-x86-64.so.2 --library-path ...`` so |
| 211 | +the kernel's PT_INTERP path doesn't consult LD_LIBRARY_PATH; |
| 212 | +ld.so processes ``--library-path`` itself. |
| 213 | + |
| 214 | +Verification (M9.R.39.6 install run at 04:35:41): |
| 215 | + |
| 216 | + installer.rc: 1 (Phase 1 logical failure, NOT SIGABRT) |
| 217 | + |
| 218 | + installer.log: |
| 219 | + [2026-06-26 01:35:41] Phase 1: probing hardware... |
| 220 | + [2026-06-26 01:35:41] $ /usr/bin/repro hardware probe ... |
| 221 | + Error: unhandled exception: lsblk JSON parse error: ... |
| 222 | + install failed: hardware probe failed |
| 223 | + |
| 224 | + installer.strace tail shows ``calling fini:`` records for |
| 225 | + the SAME nix glibc-2.40-66 across libpthread.so.0, |
| 226 | + libm.so.6, libc.so.6, libdl.so.2 -- ALL glibc subsystems |
| 227 | + consistently resolved to the same nix instance: |
| 228 | + |
| 229 | + glibc-2.40-66/lib/libpthread.so.0 (was Debian before) |
| 230 | + glibc-2.40-66/lib/libm.so.6 (was Debian before) |
| 231 | + glibc-2.40-66/lib/libc.so.6 (was Debian before) |
| 232 | + glibc-2.40-66/lib/ld-linux-x86-64.so.2 |
| 233 | + |
| 234 | + The munmap_chunk() crash is GONE. The installer runs past |
| 235 | + QGuiApplication, parses --automated, opens auto-config.toml, |
| 236 | + starts Phase 1, calls ``repro hardware probe``, and fails on |
| 237 | + a NEW gap: lsblk's JSON output is empty/malformed so |
| 238 | + ``repro_profile.probeFilesystemsFrom`` raises a parse error. |
| 239 | + |
| 240 | +PHASE D: BLOCKED ON NEW lsblk DOWNSTREAM GAP |
| 241 | +============================================= |
| 242 | + |
| 243 | +The Phase D plan was: rebuild ISO with fix, run install, boot |
| 244 | +installed disk, run DE smoke (sway/kwin/mutter/plasma/sddm |
| 245 | +--version). Stage 1 (install) now reaches Phase 1 but Phase 1 |
| 246 | +fails on lsblk JSON parsing. This is a SEPARATE issue: |
| 247 | + |
| 248 | + * ``repro hardware probe`` spawns ``lsblk --json -b -o ...`` |
| 249 | + and parses the output. The output is being given as |
| 250 | + empty ("input(1, 1) Error: { expected"). |
| 251 | + * Possible causes (not yet diagnosed): |
| 252 | + - lsblk binary missing from live ISO (FS:done util-linux |
| 253 | + recipe shadow link path) |
| 254 | + - lsblk runtime dep missing (libblkid / libsmartcols) |
| 255 | + - lsblk command-line flag mismatch between the from-source |
| 256 | + util-linux version and the ``repro hardware probe`` caller's |
| 257 | + expectation |
| 258 | + - The ``repro`` binary's `runProcess()` returns empty when |
| 259 | + the spawned command exits non-zero |
| 260 | + |
| 261 | +This is a NEW investigation surface, OUT OF SCOPE for the |
| 262 | +M9.R.39 milestone. The honest closeout is: the M9.R.38 |
| 263 | +regression IS CLOSED (no more munmap_chunk + SIGABRT); the |
| 264 | +NEXT layer is documented + handed off. |
| 265 | + |
| 266 | +EVIDENCE FILES LEFT IN /tmp ON ELI-WSL |
| 267 | +======================================= |
| 268 | + |
| 269 | + /tmp/m9r39_diag_phaseB/ M9.R.39.4 run, RC=134 SIGABRT |
| 270 | + installer.strace pre-fix syscall trace (985 KiB) |
| 271 | + installer.log "munmap_chunk(): invalid pointer\nAborted" |
| 272 | + installer.binfo pre-fix ldd resolution view |
| 273 | + installer.diag.tar.gz 73 KiB tarball |
| 274 | + installer.rc 134 |
| 275 | + |
| 276 | + /tmp/m9r39_diag_phaseC/ M9.R.39.6 run, RC=1 Phase 1 lsblk gap |
| 277 | + installer.strace post-fix trace, NO SIGABRT |
| 278 | + installer.log Phase 1 ... "lsblk JSON parse error" |
| 279 | + installer.binfo post-fix ldd view (showing |
| 280 | + same Debian libc resolution |
| 281 | + from the ldd subprocess -- |
| 282 | + NOT the installer's actual |
| 283 | + resolution path, which sets |
| 284 | + LD_LIBRARY_PATH per |
| 285 | + M9.R.39.6). |
| 286 | + installer.rc 1 |
| 287 | + |
| 288 | + /tmp/m9r39_install.log full QEMU serial transcript |
| 289 | + (boot + autorun + crash + poweroff) |
| 290 | + /tmp/m9r39_install.qcow2 192 KiB qcow2 (no install body |
| 291 | + written -- Phase 1 failed) |
| 292 | + |
| 293 | +SCRIPTS / TOOLS LEFT IN REPO |
| 294 | +============================ |
| 295 | + |
| 296 | + _m9r39_install.sh install driver (no FIFO; reads |
| 297 | + back diag tarball from |
| 298 | + /tmp/m9r39_diag.qcow2 raw |
| 299 | + sectors) |
| 300 | + _m9r39_iso_rebuild.sh ISO rebuild with |
| 301 | + REPRO_INSTALLER_AUTORUN=1 |
| 302 | + |
| 303 | +HONEST REMAINING GAP |
| 304 | +==================== |
| 305 | + |
| 306 | +M9.R.39's primary scope -- characterise + fix the M9.R.38 |
| 307 | +heap-corruption regression -- is CLOSED. The installer no |
| 308 | +longer crashes on Qt static-init heap allocations. |
| 309 | + |
| 310 | +The "from-source propagation" architectural principle is now |
| 311 | +materialised at the launcher level: nix-glibc subsystems resolve |
| 312 | +consistently inside the installer process tree; Debian |
| 313 | +binaries on the live ISO keep their compatible glibc. |
| 314 | + |
| 315 | +G2/G3/G4 are still blocked, but on a different cause: the |
| 316 | +NEW lsblk-parse failure in Phase 1's hardware probe. That's |
| 317 | +a downstream surface that needs its own characterisation pass |
| 318 | +(M9.R.40 candidate) -- ``lsblk --json`` returning empty or |
| 319 | +malformed inside the live ISO's chroot, the ``repro`` binary's |
| 320 | +runProcess error handling, or a from-source util-linux flag |
| 321 | +mismatch. |
| 322 | + |
| 323 | +The investigator who picks this up has: |
| 324 | + * The clean test harness in _m9r39_install.sh + .iso |
| 325 | + (REPRO_INSTALLER_AUTORUN=1 enables one-button reproduction) |
| 326 | + * The full strace trace of the post-fix Phase 1 attempt |
| 327 | + showing exactly what the ``repro`` subprocess sees |
| 328 | + * The launcher's nix-glibc resolution architecture (the |
| 329 | + Phase C fix that DID land) as the starting point for |
| 330 | + extending the LD_LIBRARY_PATH override into ``repro |
| 331 | + hardware probe``'s QProcess children if that turns out |
| 332 | + to be the cause |
| 333 | + |
| 334 | +The M9.R.39 milestone closes its OWN gap and hands off the |
| 335 | +NEXT gap with full evidence. |
0 commit comments