Skip to content

gvisor/kvm crash on Arm64 #12917

@haoyifan

Description

@haoyifan

Description

Summary

The KVM platform on ARM64 crashes with SIGSEGV when the sentry's Go runtime calls VDSO functions (specifically __kernel_getrandom) inside the KVM guest. I suspect the crash is caused by a mismatch in ARM Pointer Authentication (PAC) state between Guest EL1 (where paciasp signs the return address) and Host EL0 (where autiasp verifies it after sigreturn). See my following evidence.

The crash is 100% reproducible with workloads that perform ≥100 sequential fork/exec operations, and does not occur on x86 or with the systrap or ptrace platform.

I suspect this is a bug in how gvisor handles PAC (between guest EL1 and host EL0). I patched VDSO PAC to NOP and the crash would go away. I don't think this is the right fix though so I'm opening this PR. See below for the detailed steps on how I worked around it.

Crash chain

  1. The sentry's Go runtime at Guest EL1 calls runtime.vgetrandom1,
    which invokes the VDSO's __kernel_getrandom function.

  2. The VDSO's prologue executes paciasp — this signs the return
    address (R30/LR) using the current PAC key and SP as context. However,
    gVisor does not enable PAC for the KVM guest (neither
    KVM_ARM_VCPU_PTRAUTH_ADDRESS nor KVM_ARM_VCPU_PTRAUTH_GENERIC are
    set in vcpuInit.features). The behavior of paciasp at Guest EL1
    without PAC enabled depends on HCR_EL2.API:

    • If HCR_API=0: PAC instructions trap to EL2, or are treated as
      HINT (NOP) depending on the CPU.
    • In either case, R30 is not properly signed with the host's PAC key.
  3. The VDSO saves the (incorrectly signed or unsigned) R30 to its stack
    frame, then continues executing. Eventually it executes a SVC
    instruction for the actual getrandom syscall.

  4. The SVC at Guest EL1 triggers El1_syncKERNEL_ENTRY_FROM_EL1,
    which saves the current registers (including the VDSO's PC and SP) into
    c.CPU.Registers(). Then KernelSyscallHalt → KVM exit.

  5. bluepillArchExit copies c.CPU.Registers() (containing the VDSO's
    PC, SP, and other sentry registers) into the host's UContext64 signal
    context. sigreturn restores these registers.

  6. The host thread resumes execution at the VDSO's SVC instruction at
    Host EL0. The SVC completes as a real host getrandom syscall.
    Execution continues through the VDSO epilogue.

  7. The VDSO epilogue loads R30 from the stack (ldp x29, x30, [sp], #128)
    and executes autiasp — which verifies R30's PAC tag using the
    host's PAC key. The tag was computed with the guest's PAC state
    (or not computed at all). Verification fails. The CPU raises
    an exception, delivered as SIGILL.

  8. gVisor's sighandler catches all SIGILLs (it's installed for the
    bluepill mechanism). It calls bluepillArchEnter, which reads R8 from
    the signal context as a *vCPU pointer. But R8 contains whatever the
    VDSO had in R8 at the time of the fault — not a valid vCPU pointer.
    Dereferencing it causes SIGSEGV.

Why x86 Is Not Affected

There is no equivalent of VDSO PAC because x86 does not have pointer authentication.

Why It Correlates With Fork/Exec Count

The Go runtime calls runtime.vgetrandom1 for random number generation (goroutine IDs, map hash seeds, stack randomization). More fork/exec → more goroutines → more vgetrandom1 calls → higher probability that a VDSO getrandom SVC is the specific syscall captured in CPU_REGISTERS at VM exit time.

Evidence

1. Crash PC is always VDSO autiasp

Four independent gdb catches (attaching to the sandbox process with a
SIGSEGV catchpoint) all show the SIGILL's faulting PC at the same VDSO
offset:

ctx.PC = 0xe2833773bc3c   (VDSO base + 0xc3c)
ctx.PC = 0xe93b93ccdc3c   (VDSO base + 0xc3c)
ctx.PC = 0xf33f80d43c3c   (VDSO base + 0xc3c)
ctx.PC = 0xfd33596b5c3c   (VDSO base + 0xc3c)

VDSO objdump confirms autiasp at offset 0xc3c, inside
__kernel_getrandom:

$ objdump -d /tmp/vdso.so | grep -E "paciasp|autiasp"
 a68: d503233f  paciasp        ← __kernel_getrandom entry
 c3c: d50323bf  autiasp        ← crash here

2. Crash PC matches the sandbox's own VDSO address

Same-run comparison (gdb attached to sandbox process, reading ctx.PC
from UContext64, and /proc/PID/maps from the same process):

VDSO base:     0xefc8da221000  (from /proc/$SANDBOX/maps)
VDSO + 0xc3c:  0xefc8da221c3c
Crash ctx.PC:  0x0000efc8da221c3c  ← identical

3. R30 on the VDSO stack has no valid PAC tag

gdb stack dump at crash:
  [SP-128+8] = 0x0000000000096c0c  ← R30, NO PAC bits

pauth_cmask = 0x007f000000000000   ← PAC bits should be in [55:49]
R30 & mask  = 0x0000000000000000   ← all zero = unsigned

4. c.CPU.Registers().Pc contains VDSO SVC address at exit

gdb breakpoint at the instruction in bluepillArchExit that reads
c.CPU.Registers().Pc:

Pc=0x96a8c  R8=0x62  ← runtime.futex SVC (normal, safe)
Pc=0x15b7c  R8=0xe9  ← runtime.mmap SVC (normal, safe)
Pc=0xf7fe34e7ddc4  R8=0x116  ← VDSO getrandom SVC (CRASH follows)

5. runsc has zero PAC instructions; only VDSO uses PAC

$ objdump -d /usr/local/bin/runsc | grep -c paciasp
0

$ objdump -d /tmp/vdso.so | grep -c paciasp
3

6. gVisor does not enable PAC for the KVM guest

// machine_arm64_unsafe.go:80
vcpuInit.features[0] |= (1 << _KVM_ARM_VCPU_PSCI_0_2)
// No KVM_ARM_VCPU_PTRAUTH_ADDRESS (5) or KVM_ARM_VCPU_PTRAUTH_GENERIC (6)

How I worked around it

At KVM platform init, mprotect the VDSO writable, replace all paciasp/autiasp with NOP, restore to read+exec. This disables PAC for VDSO functions only, within the sandbox process.

Some potential fixes (take a grain of salt)

  1. Use the same approach as x86 (synthetic signal frame)
  2. Properly synchronize PAC keys between host and guest

Steps to reproduce

100% crash rate on ARM64 with PAC-capable CPU:

runsc --platform=kvm do /bin/bash -c \
  "for i in \$(seq 1 500); do /bin/echo x > /dev/null; done; echo DONE"

Exit code: 139 (SIGSEGV)

0% crash rate on same hardware with systrap:

runsc --platform=systrap do /bin/bash -c \
  "for i in \$(seq 1 500); do /bin/echo x > /dev/null; done; echo DONE"

Exit code: 0

runsc version

Environment: NVIDIA Grace (ARMv9)
kernel 6.17.0-14-generic (4K pages)
runsc release-20260223.0. 
CPU features include `paca pacg` (PAC).

docker version (if using docker)

uname

No response

kubectl (if using Kubernetes)

repo state (if built from source)

No response

runsc debug logs (if available)

Metadata

Metadata

Labels

type: bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions