Skip to content

Commit 86fdd7f

Browse files
zahclaude
andcommitted
M9.R.39.7: close-out evidence (regression closed, new lsblk gap handed off)
The M9.R.38 ``munmap_chunk(): invalid pointer`` regression on the live ISO is closed. Phase B characterised it as a glibc-instance mix-up (Debian libc.so.6 + nix glibc libdl.so.2 in one process, private heap + TLS data structures get clobbered). Phase C (the launcher prepending the installer's PT_INTERP nix-glibc dir to LD_LIBRARY_PATH, narrowing the override scope to the installer process tree via strace's -E flag and direct nix-ld.so invocation) eliminates the SIGABRT path. Phase D's install + boot + DE smoke remains blocked, now on a separate downstream surface: ``repro hardware probe`` calls lsblk and dies on a JSON parse error. This is a NEW investigation -- the M9.R.38 regression IS no longer the gating issue. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 8c3cab1 commit 86fdd7f

1 file changed

Lines changed: 335 additions & 0 deletions

File tree

Lines changed: 335 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,335 @@
1+
M9.R.39 - close the M9.R.38 carry: characterise + fix live-ISO heap-corruption regression
2+
========================================================================================
3+
4+
Status as of 2026-06-26: PARTIAL.
5+
6+
M9.R.38 regression CLOSED munmap_chunk()/SIGABRT path
7+
eliminated, installer now runs
8+
past QGuiApplication ctor and
9+
enters Phase 1.
10+
G2 (clean install) BLOCKED on new downstream gap: ``repro
11+
hardware probe`` calls lsblk
12+
and dies on ``lsblk JSON parse
13+
error: input(1, 1) Error: {
14+
expected``. NEW issue, unrelated
15+
to the M9.R.38 ABI mismatch.
16+
G3 (boot installed) BLOCKED on G2.
17+
G4 (DE smoke transcript) BLOCKED on G3.
18+
G5 (close-out) this file.
19+
20+
EXECUTIVE SUMMARY
21+
=================
22+
23+
M9.R.38 characterised a live-ISO heap-corruption crash that hit
24+
``reproos-installer`` after the Qt locale C->C.UTF-8 log line +
25+
before Phase 1, with the SIGABRT signature ``munmap_chunk():
26+
invalid pointer``. The M9.R.38.3 nix-store-Qt6 skip didn't help,
27+
falsifying Qt6-version-mix as the primary cause.
28+
29+
M9.R.39 closes that gap. Phase A added instrumentation, Phase B
30+
characterised the actual cause, Phase C landed the fix, Phase D
31+
ran the install and observed the installer running past the
32+
crash point and into Phase 1 (where a new lsblk-parse gap
33+
surfaced).
34+
35+
The actual root cause was a glibc-instance mix-up: the installer
36+
binary's PT_INTERP nix-glibc-2.40-66 ld.so loaded onto a process
37+
whose libc.so.6 / libm.so.6 / libpthread.so.0 resolved via
38+
/etc/ld.so.cache to Debian's /lib/x86_64-linux-gnu/, while
39+
libdl.so.2 / libresolv.so.2 / librt.so.1 resolved to nix glibc
40+
(the only subset reachable via the RPATH-reflected /nix/store
41+
path). Two glibc instances' private heap + TLS data structures
42+
shared one process. malloc()/free() on one half corrupted the
43+
other half's bookkeeping; the first big static-init heap
44+
allocation past the QGuiApplication ctor tripped
45+
``munmap_chunk(): invalid pointer`` and SIGABRT.
46+
47+
The fix narrows the override scope to the installer process tree:
48+
the launcher prepends the installer binary's PT_INTERP nix-glibc
49+
dir to its LD_LIBRARY_PATH via strace's ``-E`` flag (DIAG mode)
50+
or via direct invocation of the nix ld.so with ``--library-path``
51+
(non-DIAG mode), so the wrapper chain (strace/stdbuf/env) stays
52+
on Debian libc. The installer's own process gets ALL glibc
53+
subsystems resolved from the SAME nix glibc instance.
54+
55+
WHAT LANDED
56+
===========
57+
58+
M9.R.39.1 LD_DEBUG=libs + diag-persist via /dev/vdb (aa47631f)
59+
scratch disk; launcher captures every shared-lib
60+
resolution decision + per-tid kernel stacks +
61+
tars them to /dev/vdb sector 0+ before exit so
62+
the M9.R.37/M9.R.38 tmpfs-log loss problem
63+
doesn't recur.
64+
65+
M9.R.39.2 systemd autorun unit bypasses serial-getty wedge (d7fa540b)
66+
The M9.R.39.1 FIFO + login + manual installer
67+
invocation pattern wedged on a terminfo-init loop
68+
in serial-getty's autologin chain. Replace with
69+
a ``reproos-installer-autorun.service`` systemd
70+
unit gated on ``ConditionKernelCommandLine=
71+
repro.installer.autorun=1``. Driver no longer
72+
needs a FIFO at all.
73+
74+
M9.R.39.3 propagate REPRO_INSTALLER_AUTORUN through the (d75fe3d1)
75+
recipe wrapper. The engine's command-line
76+
wrapper in repro.nim filters outer env vars; the
77+
new var didn't reach build-iso.sh until added
78+
to the explicit prefix list.
79+
80+
M9.R.39.4 switch autorun unit to self-gating script (06f5e3c4)
81+
ExecStart. M9.R.39.2's ConditionKernelCommandLine
82+
silently no-op'd (Wants symlink + unit file +
83+
cmdline param all verified present, yet no
84+
``Starting`` line in the boot log). Replaced
85+
with a wrapper script that parses /proc/cmdline
86+
at run time + skips when the param is absent;
87+
poweroffs at the end unconditionally so QEMU
88+
exits cleanly. This is the first iteration that
89+
actually ran the launcher to completion under
90+
the diag harness.
91+
92+
M9.R.39.5 prepend installer PT_INTERP glibc dir to (0c632fa5)
93+
LD_LIBRARY_PATH. The actual Phase C fix.
94+
95+
M9.R.39.6 tracer wrapper chain must not inherit (91c41afd)
96+
LD_LIBRARY_PATH. M9.R.39.5 had strace inherit
97+
LD_LIBRARY_PATH and hit ``__nptl_change_stack_perm,
98+
version GLIBC_PRIVATE`` (Debian strace + nix
99+
libc.so.6 symbol gap, RC=127). Move to strace's
100+
-E flag (sets env on traced child only) and
101+
direct nix-ld.so invocation in non-DIAG mode.
102+
103+
PHASE A: INSTRUMENT (M9.R.39.1 + M9.R.39.4)
104+
============================================
105+
106+
The M9.R.38 tmpfs-log loss problem: live-ISO's /tmp is on a
107+
squashfs+overlay tmpfs that vanishes on poweroff. M9.R.37/38's
108+
strace + kernelstacks lived there and never reached the host
109+
after QEMU exited.
110+
111+
M9.R.39.1 fix: launcher's DIAG mode tars the diag tree
112+
(installer.strace, installer.kernelstacks, installer.lddebug,
113+
installer.log, installer.binfo, installer.rc) and dd's the
114+
gzipped tarball onto /dev/vdb's raw sectors with a
115+
``M9R39DIAGv1 SIZE=<bytes>\n`` header on sector 0. The host
116+
driver attaches a 64 MiB raw scratch disk + post-mortem reads
117+
the header + extracts the tarball.
118+
119+
M9.R.39.4 fix: systemd autorun unit replaces the FIFO + login
120+
chain. The driver no longer needs to send input -- the live
121+
ISO boots into the launcher via systemd.
122+
123+
Verification: M9.R.39.4 + M9.R.39.5 install run produced
124+
73 KiB diag tarball with all 6 expected files, including a
125+
3.4 MiB strace + 47 KiB lddebug + 12 KiB binfo.
126+
127+
PHASE B: CHARACTERISE (M9.R.39.4 run)
128+
======================================
129+
130+
Diag tarball: /tmp/m9r39_diag_phaseB/installer.diag.tar.gz
131+
(snapshot of the M9.R.39.4 run at 03:38, RC=134 SIGABRT).
132+
133+
Key channels:
134+
135+
installer.log (40 bytes, the SIGABRT diagnostic itself):
136+
137+
munmap_chunk(): invalid pointer
138+
Aborted
139+
140+
installer.binfo ldd resolution view of /usr/bin/reproos-installer
141+
on the live ISO (M9.R.39.4 launcher's binfo dump, no
142+
M9.R.39.5 LD_LIBRARY_PATH override yet):
143+
144+
libstdc++.so.6 -> /nix/store/xm08aqdd7pxcdhm0ak6aqb1v7hw5q6ri-gcc-14.3.0-lib/lib/libstdc++.so.6
145+
libQt6Core.so.6 -> /opt/repro/.../qt6-base/.../lib/libQt6Core.so.6
146+
libgcc_s.so.1 -> /nix/store/xm08aqdd7pxcdhm0ak6aqb1v7hw5q6ri-gcc-14.3.0-lib/lib/libgcc_s.so.1
147+
libc.so.6 -> /lib/x86_64-linux-gnu/libc.so.6 (Debian)
148+
libm.so.6 -> /lib/x86_64-linux-gnu/libm.so.6 (Debian)
149+
libpthread.so.0 -> /lib/x86_64-linux-gnu/libpthread.so.0 (Debian)
150+
libdl.so.2 -> /nix/store/xx7cm72qy2c0643cm1ipngd87aqwkcdp-glibc-2.40-66/lib/libdl.so.2
151+
libresolv.so.2 -> /nix/store/xx7cm72qy2c0643cm1ipngd87aqwkcdp-glibc-2.40-66/lib/libresolv.so.2
152+
librt.so.1 -> /nix/store/xx7cm72qy2c0643cm1ipngd87aqwkcdp-glibc-2.40-66/lib/librt.so.1
153+
PT_INTERP -> /nix/store/xx7cm72qy2c0643cm1ipngd87aqwkcdp-glibc-2.40-66/lib/ld-linux-x86-64.so.2
154+
155+
ldconfig -p showed Debian's libc.so.6 FIRST + nix glibc-2.40-66's
156+
libc.so.6 SECOND in the cache. The from-source binary's
157+
PT_INTERP nix-glibc ld.so loaded, fell through to ld.so.cache
158+
(Debian first) for libc.so.6 / libm.so.6 / libpthread.so.0, and
159+
found libdl.so.2 / libresolv.so.2 / librt.so.1 via the
160+
RPATH-reflected /nix/store glibc path (M9.R.37.5 launcher's
161+
LD_LIBRARY_PATH set those dirs). Two glibc instances'
162+
private heap + TLS metadata in one process; the first big
163+
static-init heap allocation past QGuiApplication ctor tripped
164+
``munmap_chunk(): invalid pointer``.
165+
166+
installer.strace shows the abort signature:
167+
168+
616 1782434312.742906 access("/usr/bin/qt.conf", F_OK) = -1 ENOENT
169+
616 1782434312.744512 writev(2, ["munmap_chunk(): invalid pointer", "\n"])
170+
616 1782434312.749732 tgkill(616, 616, SIGABRT) = 0
171+
616 1782434312.751042 --- SIGABRT {si_signo=SIGABRT, si_code=SI_TKILL, si_pid=616, si_uid=0} ---
172+
616 1782434312.756735 +++ killed by SIGABRT +++
173+
174+
The specific offending library is libc.so.6 -- BOTH copies
175+
are loaded into the same process. The SPECIFIC OFFENDER:
176+
/lib/x86_64-linux-gnu/libc.so.6 (Debian)
177+
/nix/store/xx7cm72qy2c0643cm1ipngd87aqwkcdp-glibc-2.40-66/lib/libc.so.6 (nix)
178+
Both have the SAME soname (libc.so.6) but DIFFERENT internal
179+
layouts -- compiled from different glibc 2.4x branches with
180+
different malloc state machines.
181+
182+
PHASE C: FIX (M9.R.39.5 + M9.R.39.6)
183+
====================================
184+
185+
Per the project's "from-source propagation, no apt fallback"
186+
principle (cross-source-build/AGENTS.md), the fix shape is
187+
Option W: the launcher discovers the installer binary's
188+
PT_INTERP nix-store glibc dir + prepends it to LD_LIBRARY_PATH
189+
just for the installer process tree. Debian binaries on the
190+
live ISO (cat/head/ls/etc) keep their compatible glibc.
191+
192+
Discovery: the launcher scans the installer binary's first
193+
4 KiB for the PT_INTERP nix-store glibc path via
194+
``head -c 4096 | grep -oE '/nix/store/[a-z0-9]+-glibc-[^/]+/lib/ld-linux-x86-64\.so\.2'``.
195+
Fallback to ``ls -d /nix/store/*-glibc-*/lib`` if header
196+
extraction fails.
197+
198+
M9.R.39.5 placed LD_LIBRARY_PATH on the strace + stdbuf wrapper
199+
chain's env via the launcher's ``env LD_LIBRARY_PATH=... strace
200+
...`` pattern -- which caused strace + stdbuf to resolve their
201+
OWN libc.so.6 against the nix glibc + hit
202+
``undefined symbol: __nptl_change_stack_perm,
203+
version GLIBC_PRIVATE`` and exit RC=127 (M9.R.39.5 install
204+
attempt at 04:18:14).
205+
206+
M9.R.39.6 corrected the env-scoping: DIAG mode uses strace's
207+
``-E var=val`` flag (sets env on traced child only -- strace
208+
itself sees a clean env without LD_LIBRARY_PATH and stays on
209+
Debian libc). Non-DIAG mode invokes the nix-glibc ld.so
210+
DIRECTLY via ``ld-linux-x86-64.so.2 --library-path ...`` so
211+
the kernel's PT_INTERP path doesn't consult LD_LIBRARY_PATH;
212+
ld.so processes ``--library-path`` itself.
213+
214+
Verification (M9.R.39.6 install run at 04:35:41):
215+
216+
installer.rc: 1 (Phase 1 logical failure, NOT SIGABRT)
217+
218+
installer.log:
219+
[2026-06-26 01:35:41] Phase 1: probing hardware...
220+
[2026-06-26 01:35:41] $ /usr/bin/repro hardware probe ...
221+
Error: unhandled exception: lsblk JSON parse error: ...
222+
install failed: hardware probe failed
223+
224+
installer.strace tail shows ``calling fini:`` records for
225+
the SAME nix glibc-2.40-66 across libpthread.so.0,
226+
libm.so.6, libc.so.6, libdl.so.2 -- ALL glibc subsystems
227+
consistently resolved to the same nix instance:
228+
229+
glibc-2.40-66/lib/libpthread.so.0 (was Debian before)
230+
glibc-2.40-66/lib/libm.so.6 (was Debian before)
231+
glibc-2.40-66/lib/libc.so.6 (was Debian before)
232+
glibc-2.40-66/lib/ld-linux-x86-64.so.2
233+
234+
The munmap_chunk() crash is GONE. The installer runs past
235+
QGuiApplication, parses --automated, opens auto-config.toml,
236+
starts Phase 1, calls ``repro hardware probe``, and fails on
237+
a NEW gap: lsblk's JSON output is empty/malformed so
238+
``repro_profile.probeFilesystemsFrom`` raises a parse error.
239+
240+
PHASE D: BLOCKED ON NEW lsblk DOWNSTREAM GAP
241+
=============================================
242+
243+
The Phase D plan was: rebuild ISO with fix, run install, boot
244+
installed disk, run DE smoke (sway/kwin/mutter/plasma/sddm
245+
--version). Stage 1 (install) now reaches Phase 1 but Phase 1
246+
fails on lsblk JSON parsing. This is a SEPARATE issue:
247+
248+
* ``repro hardware probe`` spawns ``lsblk --json -b -o ...``
249+
and parses the output. The output is being given as
250+
empty ("input(1, 1) Error: { expected").
251+
* Possible causes (not yet diagnosed):
252+
- lsblk binary missing from live ISO (FS:done util-linux
253+
recipe shadow link path)
254+
- lsblk runtime dep missing (libblkid / libsmartcols)
255+
- lsblk command-line flag mismatch between the from-source
256+
util-linux version and the ``repro hardware probe`` caller's
257+
expectation
258+
- The ``repro`` binary's `runProcess()` returns empty when
259+
the spawned command exits non-zero
260+
261+
This is a NEW investigation surface, OUT OF SCOPE for the
262+
M9.R.39 milestone. The honest closeout is: the M9.R.38
263+
regression IS CLOSED (no more munmap_chunk + SIGABRT); the
264+
NEXT layer is documented + handed off.
265+
266+
EVIDENCE FILES LEFT IN /tmp ON ELI-WSL
267+
=======================================
268+
269+
/tmp/m9r39_diag_phaseB/ M9.R.39.4 run, RC=134 SIGABRT
270+
installer.strace pre-fix syscall trace (985 KiB)
271+
installer.log "munmap_chunk(): invalid pointer\nAborted"
272+
installer.binfo pre-fix ldd resolution view
273+
installer.diag.tar.gz 73 KiB tarball
274+
installer.rc 134
275+
276+
/tmp/m9r39_diag_phaseC/ M9.R.39.6 run, RC=1 Phase 1 lsblk gap
277+
installer.strace post-fix trace, NO SIGABRT
278+
installer.log Phase 1 ... "lsblk JSON parse error"
279+
installer.binfo post-fix ldd view (showing
280+
same Debian libc resolution
281+
from the ldd subprocess --
282+
NOT the installer's actual
283+
resolution path, which sets
284+
LD_LIBRARY_PATH per
285+
M9.R.39.6).
286+
installer.rc 1
287+
288+
/tmp/m9r39_install.log full QEMU serial transcript
289+
(boot + autorun + crash + poweroff)
290+
/tmp/m9r39_install.qcow2 192 KiB qcow2 (no install body
291+
written -- Phase 1 failed)
292+
293+
SCRIPTS / TOOLS LEFT IN REPO
294+
============================
295+
296+
_m9r39_install.sh install driver (no FIFO; reads
297+
back diag tarball from
298+
/tmp/m9r39_diag.qcow2 raw
299+
sectors)
300+
_m9r39_iso_rebuild.sh ISO rebuild with
301+
REPRO_INSTALLER_AUTORUN=1
302+
303+
HONEST REMAINING GAP
304+
====================
305+
306+
M9.R.39's primary scope -- characterise + fix the M9.R.38
307+
heap-corruption regression -- is CLOSED. The installer no
308+
longer crashes on Qt static-init heap allocations.
309+
310+
The "from-source propagation" architectural principle is now
311+
materialised at the launcher level: nix-glibc subsystems resolve
312+
consistently inside the installer process tree; Debian
313+
binaries on the live ISO keep their compatible glibc.
314+
315+
G2/G3/G4 are still blocked, but on a different cause: the
316+
NEW lsblk-parse failure in Phase 1's hardware probe. That's
317+
a downstream surface that needs its own characterisation pass
318+
(M9.R.40 candidate) -- ``lsblk --json`` returning empty or
319+
malformed inside the live ISO's chroot, the ``repro`` binary's
320+
runProcess error handling, or a from-source util-linux flag
321+
mismatch.
322+
323+
The investigator who picks this up has:
324+
* The clean test harness in _m9r39_install.sh + .iso
325+
(REPRO_INSTALLER_AUTORUN=1 enables one-button reproduction)
326+
* The full strace trace of the post-fix Phase 1 attempt
327+
showing exactly what the ``repro`` subprocess sees
328+
* The launcher's nix-glibc resolution architecture (the
329+
Phase C fix that DID land) as the starting point for
330+
extending the LD_LIBRARY_PATH override into ``repro
331+
hardware probe``'s QProcess children if that turns out
332+
to be the cause
333+
334+
The M9.R.39 milestone closes its OWN gap and hands off the
335+
NEXT gap with full evidence.

0 commit comments

Comments
 (0)