-
Notifications
You must be signed in to change notification settings - Fork 2k
Open
Description
Kernel GPF in dbuf_lightweight_bp during concurrent writes to encrypted RAIDZ2 (ZFS 2.4.0, kernel 6.17.9)
Environment
- ZFS: zfs-kmod-2.4.0 (distribution kernel package)
- Kernel: 6.17.9 (PREEMPT_VOLUNTARY, x86_64)
- Pool: RAIDZ2, 4x 18TB HDD (Seagate Exos ST18000NM000J), ashift=12
- Dataset: native encryption (aes-256-gcm), recordsize=128K
- No SLOG, no L2ARC
Workload
Two independent SMB (Samba) clients writing concurrently to the same encrypted dataset:
- Client A: rclone bulk file copy (large files, sequential)
- Client B: data recovery tool writing recovered files (mixed sizes, somewhat random)
Both clients sustained heavy writes for several hours (~10+ hours) before the crash.
Crash Details
General protection fault in dbuf_lightweight_bp, triggered from the z_wr_iss taskq thread during zio ready processing:
[10724.925720] Oops: general protection fault, probably for non-canonical address 0x34c0768bf1ac340c: 0000 [#1] SMP NOPTI
[10724.925732] CPU: 12 UID: 0 PID: 116192 Comm: z_wr_iss Tainted: P O 6.17.9-1-pve #1 PREEMPT(voluntary)
[10724.925735] Tainted: [P]=PROPRIETARY_MODULE, [O]=OOT_MODULE
[10724.925737] Hardware name: ASUS System Product Name/PRIME B660M-A D4, BIOS 3801 05/14/2025
[10724.925739] RIP: 0010:dbuf_lightweight_bp+0x1f/0x1b0 [zfs]
[10724.925927] Code: 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 55 48 89 e5 41 56 41 55 41 54 53 48 89 fb 48 83 ec 08 4c 8b 6f 38 49 8b 45 60 <80> 78 02 01 0f 84 9f 00 00 00 48 8b 47 40 45 0f b6 65 73 4c 8b 70
[10724.925931] RSP: 0018:ffffced175c17c80 EFLAGS: 00010282
[10724.925933] RAX: 34c0768bf1ac340a RBX: ffff8a801e209c00 RCX: 0000000000000000
[10724.925936] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8a801e209c00
[10724.925939] RBP: ffffced175c17ca8 R08: 0000000000000000 R09: 0000000000000000
[10724.925941] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8a709f815f10
[10724.925943] R13: ffff8a709f815f10 R14: ffff8a7ffa32f980 R15: ffff8a61a36e8358
[10724.925945] FS: 0000000000000000(0000) GS:ffff8a80df986000(0000) knlGS:0000000000000000
[10724.925947] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[10724.925949] CR2: 00007ce5a4afe000 CR3: 00000001e41b9006 CR4: 0000000000f72ef0
[10724.925951] PKRU: 55555554
[10724.925953] Call Trace:
[10724.925955] <TASK>
[10724.925958] dbuf_lightweight_ready+0x46/0x2b0 [zfs]
[10724.926058] zio_ready+0x54/0x440 [zfs]
[10724.926161] zio_execute+0x8f/0x140 [zfs]
[10724.926268] taskq_thread+0x349/0x720 [spl]
[10724.926276] ? __pfx_default_wake_function+0x10/0x10
[10724.926280] ? __pfx_zio_execute+0x10/0x10 [zfs]
[10724.926380] ? __pfx_taskq_thread+0x10/0x10 [spl]
[10724.926387] kthread+0x108/0x220
[10724.926389] ? __pfx_kthread+0x10/0x10
[10724.926392] ret_from_fork+0x205/0x240
[10724.926395] ? __pfx_kthread+0x10/0x10
[10724.926398] ret_from_fork_asm+0x1a/0x30
[10724.926401] </TASK>
Post-crash behaviour
- Pool remained ONLINE;
zpool statuswas responsive - SMB worker processes hung (likely D-state waiting on ZFS I/O)
synccommand hung indefinitely (write path broken)- Clean reboot was slow (blocked by hung sync) but eventually completed after a ~25 minute wait
- No new data errors detected after reboot; scrub in progress
Analysis
- The non-canonical address in RAX (
0x34c0768bf1ac340a) strongly suggests use-after-free or a stale pointer dereference -- the dirty record or parent dnode appears to have been freed or recycled while the zio was still in flight. dbuf_lightweight_bpis a zio callback registered during txg sync for lightweight dirty records. It dereferences the dbuf'sdb_dnode_handleto reach the dnode, and the faulting instruction is consistent with following a poisoned or freed pointer from that chain.- The lightweight write path (
dbuf_dirty_lightweight()) was originally designed for sequential write-only workloads (primarilyzfs receive), but concurrent random writes via the VFS/Samba path also appear to exercise this code. - Encryption (aes-256-gcm) extends the zio pipeline with additional async stages (encrypt -> checksum -> write), potentially widening a race window between dirty record lifetime management and zio completion callbacks.
zfs_dirty_data_maxwas set to 34GB at the time of the crash (auto-tuned), which may have increased dirty record pressure and extended the window for a race on HDD-backed RAIDZ2 where write latency is high.
Possibly related issues
- [2.1] Fix raw receive with different indirect block size. #15073 --
dbuf_dirty_lightweightassertion failure - Encryption causes kernel panics (with and without compression) within an hour of high write I/O load #10570 -- encryption + heavy writes kernel panic
- GPF for non-canonical address in
dmu_zfetch_fini#16895 -- GPF use-after-free in dnode cleanup - 2.3.2 causing kernel panic and I/O hangs, 2.3.1 works on same dataset #17307 -- kernel panic in write path
Workaround
Reduced zfs_dirty_data_max to 2GB, zfs_txg_timeout to 3 seconds, and serialised write workloads (one heavy writer at a time). No recurrence since applying these changes, but this is not confirmed as a fix -- it may simply reduce the probability of hitting the race.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels