Skip to content

Kernel GPF in dbuf_lightweight_bp during concurrent writes to encrypted RAIDZ2 (ZFS 2.4.0, kernel 6.17.9) #18253

@DevJake

Description

@DevJake

Kernel GPF in dbuf_lightweight_bp during concurrent writes to encrypted RAIDZ2 (ZFS 2.4.0, kernel 6.17.9)

Environment

  • ZFS: zfs-kmod-2.4.0 (distribution kernel package)
  • Kernel: 6.17.9 (PREEMPT_VOLUNTARY, x86_64)
  • Pool: RAIDZ2, 4x 18TB HDD (Seagate Exos ST18000NM000J), ashift=12
  • Dataset: native encryption (aes-256-gcm), recordsize=128K
  • No SLOG, no L2ARC

Workload

Two independent SMB (Samba) clients writing concurrently to the same encrypted dataset:

  • Client A: rclone bulk file copy (large files, sequential)
  • Client B: data recovery tool writing recovered files (mixed sizes, somewhat random)

Both clients sustained heavy writes for several hours (~10+ hours) before the crash.

Crash Details

General protection fault in dbuf_lightweight_bp, triggered from the z_wr_iss taskq thread during zio ready processing:

[10724.925720] Oops: general protection fault, probably for non-canonical address 0x34c0768bf1ac340c: 0000 [#1] SMP NOPTI
[10724.925732] CPU: 12 UID: 0 PID: 116192 Comm: z_wr_iss Tainted: P           O        6.17.9-1-pve #1 PREEMPT(voluntary)
[10724.925735] Tainted: [P]=PROPRIETARY_MODULE, [O]=OOT_MODULE
[10724.925737] Hardware name: ASUS System Product Name/PRIME B660M-A D4, BIOS 3801 05/14/2025
[10724.925739] RIP: 0010:dbuf_lightweight_bp+0x1f/0x1b0 [zfs]
[10724.925927] Code: 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 55 48 89 e5 41 56 41 55 41 54 53 48 89 fb 48 83 ec 08 4c 8b 6f 38 49 8b 45 60 <80> 78 02 01 0f 84 9f 00 00 00 48 8b 47 40 45 0f b6 65 73 4c 8b 70
[10724.925931] RSP: 0018:ffffced175c17c80 EFLAGS: 00010282
[10724.925933] RAX: 34c0768bf1ac340a RBX: ffff8a801e209c00 RCX: 0000000000000000
[10724.925936] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8a801e209c00
[10724.925939] RBP: ffffced175c17ca8 R08: 0000000000000000 R09: 0000000000000000
[10724.925941] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8a709f815f10
[10724.925943] R13: ffff8a709f815f10 R14: ffff8a7ffa32f980 R15: ffff8a61a36e8358
[10724.925945] FS:  0000000000000000(0000) GS:ffff8a80df986000(0000) knlGS:0000000000000000
[10724.925947] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[10724.925949] CR2: 00007ce5a4afe000 CR3: 00000001e41b9006 CR4: 0000000000f72ef0
[10724.925951] PKRU: 55555554
[10724.925953] Call Trace:
[10724.925955]  <TASK>
[10724.925958]  dbuf_lightweight_ready+0x46/0x2b0 [zfs]
[10724.926058]  zio_ready+0x54/0x440 [zfs]
[10724.926161]  zio_execute+0x8f/0x140 [zfs]
[10724.926268]  taskq_thread+0x349/0x720 [spl]
[10724.926276]  ? __pfx_default_wake_function+0x10/0x10
[10724.926280]  ? __pfx_zio_execute+0x10/0x10 [zfs]
[10724.926380]  ? __pfx_taskq_thread+0x10/0x10 [spl]
[10724.926387]  kthread+0x108/0x220
[10724.926389]  ? __pfx_kthread+0x10/0x10
[10724.926392]  ret_from_fork+0x205/0x240
[10724.926395]  ? __pfx_kthread+0x10/0x10
[10724.926398]  ret_from_fork_asm+0x1a/0x30
[10724.926401]  </TASK>

Post-crash behaviour

  • Pool remained ONLINE; zpool status was responsive
  • SMB worker processes hung (likely D-state waiting on ZFS I/O)
  • sync command hung indefinitely (write path broken)
  • Clean reboot was slow (blocked by hung sync) but eventually completed after a ~25 minute wait
  • No new data errors detected after reboot; scrub in progress

Analysis

  • The non-canonical address in RAX (0x34c0768bf1ac340a) strongly suggests use-after-free or a stale pointer dereference -- the dirty record or parent dnode appears to have been freed or recycled while the zio was still in flight.
  • dbuf_lightweight_bp is a zio callback registered during txg sync for lightweight dirty records. It dereferences the dbuf's db_dnode_handle to reach the dnode, and the faulting instruction is consistent with following a poisoned or freed pointer from that chain.
  • The lightweight write path (dbuf_dirty_lightweight()) was originally designed for sequential write-only workloads (primarily zfs receive), but concurrent random writes via the VFS/Samba path also appear to exercise this code.
  • Encryption (aes-256-gcm) extends the zio pipeline with additional async stages (encrypt -> checksum -> write), potentially widening a race window between dirty record lifetime management and zio completion callbacks.
  • zfs_dirty_data_max was set to 34GB at the time of the crash (auto-tuned), which may have increased dirty record pressure and extended the window for a race on HDD-backed RAIDZ2 where write latency is high.

Possibly related issues

Workaround

Reduced zfs_dirty_data_max to 2GB, zfs_txg_timeout to 3 seconds, and serialised write workloads (one heavy writer at a time). No recurrence since applying these changes, but this is not confirmed as a fix -- it may simply reduce the probability of hitting the race.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions