Skip to content

0.63.1: PAC failure in TerminalController.v2EnsureHandleRef when a v2 socket command races startup state restore #2751

@bondijois

Description

@bondijois

cmux version

0.63.1 (build 78)

macOS version

macOS 26.4 (25E246)

Mac chip

Apple Silicon (M1/M2/M3/M4)

Installation method

Homebrew

Can you reproduce this on cmux NIGHTLY?

Yes, it still reproduces on NIGHTLY

Bug description

cmux 0.63.1 (build 78) crashed with EXC_BAD_ACCESS (SIGSEGV) and a pointer-authentication failure inside TerminalController.v2EnsureHandleRef(kind:uuid:), called from TerminalController.v2RefreshKnownRefs(). The crash happened ~1.13s after process launch while a CLI client was processing a v2 command on the cmux Unix socket — Thread 23 had dispatch.sync-ed onto the main thread to run the refresh, and the main thread faulted while iterating handle UUIDs. A second client (Thread 22) was already connected and idle in read() at the same time, so this is at minimum a 2-client scenario during state restore.

Apple's crash reporter annotates the fault as KERN_INVALID_ADDRESS at 0x8000000000000010 -> 0x0000000000000010 (possible pointer authentication failure). The faulting instruction is LDR x8, [x25, #0x10] (28 0b 40 f9); x25 was 0x8000000000000000 (a pointer with all PAC bits set after authentication failure) and far = x25 + 0x10 is the dereference target. Stripping the high bit gives 0x10, i.e. cmux is reading field-at-offset-0x10 of an object whose pointer failed PAC verification — almost always either a use-after-free where the freed slot was reused with garbage, or an uninitialized pointer being read out of v2RefByUUID / v2UUIDByRef before the handle table was fully populated.

I have one crash report with this exact signature so far. The 1.1s launch-to-crash delta plus the dispatch chain (processV2Command → dispatch.sync → v2RefreshKnownRefs) make a startup-race interpretation more plausible than a deterministic bug, but a one-off transient corruption isn't ruled out.

Expected behavior

A v2 socket command arriving immediately after cmux launch should not be able to fault the main thread.

Two parallel-fix patterns already exist in this code area and either would address it:

  1. Wait for handle-table readiness, the way send_key does for missing surfaces (see fix: wait for terminal surface in readTerminalTextBase64 #2006send_key calls waitForTerminalSurface() while read_text was crashing/erroring instead).
  2. Catch and gracefully degrade, the way fix: gracefully handle TabManager unavailable in claude-hook stop #1935 made claude-hook stop survive a nil tabManager during teardown.

A heavier fix would be to defer acceptLoop (or just v2 command processing) until v2RefByUUID / v2UUIDByRef are fully populated from persisted session state, so no v2 command can be dispatched against a half-built handle table.

Steps to reproduce

I have not been able to deterministically reproduce this on demand — it's happened once for me in a week of regular use. The best repro pattern I can describe:

  1. Have any shell init script call a v2 cmux command immediately on startup. Minimal example for ~/.zshrc:
    if [[ -n "$CMUX_WORKSPACE_ID" ]]; then
      cmux list-workspaces 2>/dev/null
    fi
  2. Quit cmux fully (Cmd+Q) so the next launch performs full state restore.
  3. Relaunch cmux. The bundled shell sources .zshrc, and cmux list-workspaces hits the socket within the first ~1s of cmux being alive.
  4. Repeat across many launches over several days. The race surfaces intermittently.

What likely matters more than the exact CLI command is (a) there is persisted session state to restore on launch, (b) at least one v2 socket client connects within the first second, and (c) the v2 handle table is being built up at the same time the v2 command is running.

Shell and environment

zsh (the bundled shell that cmux launches via its command = config), oh-my-zsh, starship prompt, zsh-autosuggestions and fast-syntax-highlighting loaded. The trigger client is cmux list-workspaces invoked from .zshrc.

Relevant logs or crash reports

### Build / binary identity

So you can resolve symbols against the matching dSYM:

- cmux short version: `0.63.1`
- cmux build version: `78`
- cmux Mach-O UUID: `e9384773-36fe-3706-9f29-aa90e319e3c6`
- cmux load address: `0x102200000`

### Exception


Exception Type:    EXC_BAD_ACCESS (SIGSEGV)
Exception Subtype: KERN_INVALID_ADDRESS at 0x8000000000000010
                   -> 0x0000000000000010 (possible pointer authentication failure)
Termination Reason: Namespace SIGNAL, Code 11, Segmentation fault: 11


### Faulting thread (main thread)


Thread 0 Crashed::  Dispatch queue: com.apple.main-thread
0   cmux                   TerminalController.v2EnsureHandleRef(kind:uuid:) + 124
1   cmux                   TerminalController.v2RefreshKnownRefs() + 1088
2   libswiftDispatch.dylib partial apply for thunk for @callee_guaranteed () -> (@out A, @error @owned Error) + 28
3   libswiftDispatch.dylib partial apply for thunk for @callee_guaranteed () -> (@out A, @error @owned Error) + 16
4   libswiftDispatch.dylib closure #1 in closure #1 in OS_dispatch_queue._syncHelper<A>(fn:execute:rescue:) + 192
5   libswiftDispatch.dylib partial apply for thunk for @callee_guaranteed () -> () + 28
6   libswiftDispatch.dylib thunk for @escaping @callee_guaranteed () -> () + 28
7   libdispatch.dylib      _dispatch_client_callout + 16
8   libdispatch.dylib      _dispatch_async_and_wait_invoke + 84
9   libdispatch.dylib      _dispatch_client_callout + 16
10  libdispatch.dylib      _dispatch_main_queue_drain.cold.6 + 832
11  libdispatch.dylib      _dispatch_main_queue_drain + 176
12  libdispatch.dylib      _dispatch_main_queue_callback_4CF + 44
... [Apple framework runloop frames]


### Faulting instruction


PC:   0x102555d34 = TerminalController.v2EnsureHandleRef(kind:uuid:) + 124
LR:   0x102555d30 = TerminalController.v2EnsureHandleRef(kind:uuid:) + 120
Bytes at PC: 28 0b 40 f9   →   LDR x8, [x25, #0x10]


### Register state at fault


x25 = 0x8000000000000000   ← high bit set; PAC verify failed
x24 = 0x00000001fc7faf20   value witness table for UUID
x26 = 0x00000001fc7faf88
x27 = 0x0000000b38de4920
x28 = 0x0000000b3bab7c60
x10 = 0x0000000103aa7b08   value witness table for PaneID
x11 = x12 = 0x0000000000185093
far = 0x8000000000000010   ← x25 + 0x10
esr = 0x92000006           Data Abort, byte read, Translation fault


So the function is loading 8 bytes from a `[reference + 0x10]` field of an object whose pointer came out of memory with PAC bits already corrupted. The two value witness tables in the surrounding registers (`UUID` and `PaneID`) suggest the iteration is over `[PaneID: <something>]` or `[UUID: <something>]` — i.e. the v2 handle dictionaries directly. This lines up with #2192's description of `v2RefByUUID[.surface]` / `v2UUIDByRef[.surface]` being the long-lived state in this area.

### Triggering CLI client (Thread 23)


Thread 23
0   libsystem_kernel.dylib __ulock_wait + 8
1   libdispatch.dylib      _dispatch_thread_main_event_wait_slow + 76
2   libdispatch.dylib      __DISPATCH_WAIT_FOR_QUEUE__ + 464
3   libdispatch.dylib      _dispatch_sync_f_slow + 140
4   libswiftDispatch.dylib OS_dispatch_queue.asyncAndWait<A>(execute:) + 144
5   libswiftDispatch.dylib OS_dispatch_queue.sync<A>(execute:) + 64
6   cmux                   TerminalController.processV2Command(_:) + 2320
7   cmux                   TerminalController.processCommand(_:) + 228
8   cmux                   TerminalController.handleClient(_:peerPid:) + 1008
9   cmux                   closure #5 in TerminalController.acceptLoop(listenerSocket:generation:) + 92


### Second client connected at the same time (Thread 22)


Thread 22
0   libsystem_kernel.dylib read + 8
1   cmux                   TerminalController.handleClient(_:peerPid:) + 768
2   cmux                   closure #5 in TerminalController.acceptLoop(listenerSocket:generation:) + 92


So at the moment of crash, two socket clients are alive: Thread 22 idle in `read()` waiting for the next command from one client, Thread 23 deep in `processV2Command` for another. Both arrived within the first ~1s of cmux launch. The race is at minimum a multi-client one.

### Crash timing across the three reports I have locally

| date | launch +N | top frame | likely separate bug? |
|---|---|---|---|
| 2026-04-04 | +0.33s | `NSFileHandleOperationException` from `CMUXTermMain.main()` | yes — different signature |
| 2026-04-05 | +0.35s | same `NSFileHandleOperationException` | yes — same as above |
| **2026-04-09** | **+1.13s** | `TerminalController.v2EnsureHandleRef + 124` | **this report** |

All three are within ~1.2s of process launch, but only the Apr 9 one is the v2EnsureHandleRef PAC failure being reported here. The Apr 4/5 NSFileHandle crashes look like a separate bug class that probably warrants its own report.

The full `.ips` for the Apr 9 crash is attached to this issue (sanitized — per-Mac identifiers, per-boot UUIDs, and Apple submission IDs stripped; all stack frames, register state, instruction bytes, and image UUIDs preserved).

Screenshots or screen recordings

No response

Additional context

Closely related issues — same bug class

This isn't a one-off; cmux has had a recurring pattern of two adjacent bug classes that converge here.

(A) CLI commands racing app-state lifecycle:

  • fix: gracefully handle TabManager unavailable in claude-hook stop #1935claude-hook stop failing because TerminalController.tabManager was set to nil during teardown before the stop hook fired. Same shape: a CLI command hits a v2 handler whose backing state isn't there. Was fixed by catching/logging instead of propagating. Our crash is the inverse-time variant: the backing state isn't there yet (during startup), and the handler doesn't catch — it dereferences and PACs.
  • fix: wait for terminal surface in readTerminalTextBase64 #2006readTerminalTextBase64 crashing/erroring on a nil surface during display-sleep reparenting, while send_key already handled this correctly via waitForTerminalSurface(). The same asymmetry probably exists between v2 commands that wait for handle-table readiness and the ones that don't. v2RefreshKnownRefs clearly doesn't.

(B) PAC failures in workspace/tab object lifecycle:

(C) The data structure being faulted on:

The pattern: cmux has had repeated PAC failures whenever workspace/tab state lifecycle races a UI or socket access. UI-trigger paths got fixes; the TerminalController.v2* socket path is the next manifestation.

Workaround on my side

I deferred the offending CLI call into a backgrounded subshell with a 2s sleep, so cmux has time to populate its v2 handle table before any commands arrive. That eliminates the trigger from my side, but the underlying issue is still there for any third-party shell init, AI agent, status-bar tool, or automation script that talks to the cmux socket immediately on launch.

Why I can't test NIGHTLY

I'm pinned to 0.63.1 because of the drag-select auto-scroll regression in 0.63.2 (tracked separately). Installing nightly would re-introduce that bug for me. If nightly already has a fix for this race — point me at the commit and I'll re-symbolicate this .ips against it and confirm.


Full sanitized crash report:

cmux-2026-04-09-023322.redacted.ips.txt

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions