-
Notifications
You must be signed in to change notification settings - Fork 2k
Closed
Labels
Type: DefectIncorrect behavior (e.g. crash, hang)Incorrect behavior (e.g. crash, hang)
Description
System information
| Type | Version/Name |
|---|---|
| Distribution Name | Ubuntu |
| Distribution Version | 22.04.5 LTS |
| Kernel Version | 5.15.0-170-generic |
| Architecture | x86_64 |
| OpenZFS Version | 2.1.5-1ubuntu6~22.04.6 |
Describe the problem you're observing
Looks like we've caught a deadlock between txg_sync and agents (zed?) trying to call zfs_ioc_vdev_attach() in one of our systems, here are the stacks:
txg_sync:
[Fri Feb 13 09:43:51 2026] task:txg_sync state:D stack: 0 pid: 8607 ppid: 2 flags:0x00004000
[Fri Feb 13 09:43:51 2026] Call Trace:
[Fri Feb 13 09:43:51 2026] <TASK>
[Fri Feb 13 09:43:51 2026] __schedule+0x24e/0x590
[Fri Feb 13 09:43:51 2026] schedule+0x69/0x110
[Fri Feb 13 09:43:51 2026] cv_wait_common+0xf8/0x130 [spl]
[Fri Feb 13 09:43:51 2026] ? wait_woken+0x70/0x70
[Fri Feb 13 09:43:51 2026] __cv_wait+0x15/0x20 [spl]
[Fri Feb 13 09:43:51 2026] spa_config_enter+0xf9/0x120 [zfs]
[Fri Feb 13 09:43:51 2026] spa_sync+0x6d/0x5b0 [zfs]
[Fri Feb 13 09:43:51 2026] txg_sync_thread+0x266/0x2f0 [zfs]
[Fri Feb 13 09:43:51 2026] ? txg_dispatch_callbacks+0x100/0x100 [zfs]
[Fri Feb 13 09:43:51 2026] thread_generic_wrapper+0x64/0x80 [spl]
[Fri Feb 13 09:43:51 2026] ? __thread_exit+0x20/0x20 [spl]
[Fri Feb 13 09:43:51 2026] kthread+0x12a/0x150
[Fri Feb 13 09:43:51 2026] ? set_kthread_struct+0x50/0x50
[Fri Feb 13 09:43:51 2026] ret_from_fork+0x22/0x30
[Fri Feb 13 09:43:51 2026] </TASK>
agents:
[Fri Feb 13 09:43:51 2026] INFO: task agents:11306 blocked for more than 120 seconds.
[Fri Feb 13 09:43:51 2026] Tainted: P W O L 5.15.0-161-generic #171-Ubuntu
[Fri Feb 13 09:43:51 2026] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Fri Feb 13 09:43:51 2026] task:agents state:D stack: 0 pid:11306 ppid: 1 flags:0x00004002
[Fri Feb 13 09:43:51 2026] Call Trace:
[Fri Feb 13 09:43:51 2026] <TASK>
[Fri Feb 13 09:43:51 2026] __schedule+0x24e/0x590
[Fri Feb 13 09:43:51 2026] ? default_wake_function+0x1a/0x40
[Fri Feb 13 09:43:51 2026] schedule+0x69/0x110
[Fri Feb 13 09:43:51 2026] cv_wait_common+0xf8/0x130 [spl]
[Fri Feb 13 09:43:51 2026] ? wait_woken+0x70/0x70
[Fri Feb 13 09:43:51 2026] __cv_wait+0x15/0x20 [spl]
[Fri Feb 13 09:43:51 2026] dmu_tx_wait+0x8e/0x1e0 [zfs]
[Fri Feb 13 09:43:51 2026] dmu_tx_assign+0x49/0x80 [zfs]
[Fri Feb 13 09:43:51 2026] vdev_rebuild_initiate+0x39/0xc0 [zfs]
[Fri Feb 13 09:43:51 2026] vdev_rebuild+0x84/0x90 [zfs]
[Fri Feb 13 09:43:51 2026] spa_vdev_attach+0x305/0x680 [zfs]
[Fri Feb 13 09:43:51 2026] zfs_ioc_vdev_attach+0xc7/0xe0 [zfs]
[Fri Feb 13 09:43:51 2026] zfsdev_ioctl_common+0x683/0x740 [zfs]
[Fri Feb 13 09:43:51 2026] ? __check_object_size.part.0+0x4a/0x150
[Fri Feb 13 09:43:51 2026] ? _copy_from_user+0x31/0x70
[Fri Feb 13 09:43:51 2026] zfsdev_ioctl+0x57/0xf0 [zfs]
[Fri Feb 13 09:43:51 2026] __x64_sys_ioctl+0x95/0xd0
[Fri Feb 13 09:43:51 2026] x64_sys_call+0x1e5f/0x1fa0
[Fri Feb 13 09:43:51 2026] do_syscall_64+0x56/0xb0
[Fri Feb 13 09:43:51 2026] ? handle_mm_fault+0xd8/0x2c0
[Fri Feb 13 09:43:51 2026] ? do_user_addr_fault+0x1e7/0x640
[Fri Feb 13 09:43:51 2026] ? syscall_exit_to_user_mode+0x6a/0x80
[Fri Feb 13 09:43:51 2026] ? arch_exit_to_user_mode_prepare.constprop.0+0x1e/0xc0
[Fri Feb 13 09:43:51 2026] ? irqentry_exit_to_user_mode+0x25/0x50
[Fri Feb 13 09:43:51 2026] ? irqentry_exit+0x1d/0x30
[Fri Feb 13 09:43:51 2026] ? clear_bhb_loop+0x60/0xb0
[Fri Feb 13 09:43:51 2026] ? clear_bhb_loop+0x60/0xb0
[Fri Feb 13 09:43:51 2026] ? clear_bhb_loop+0x60/0xb0
[Fri Feb 13 09:43:51 2026] ? clear_bhb_loop+0x60/0xb0
[Fri Feb 13 09:43:51 2026] ? clear_bhb_loop+0x60/0xb0
[Fri Feb 13 09:43:51 2026] entry_SYSCALL_64_after_hwframe+0x6c/0xd6
[Fri Feb 13 09:43:51 2026] RIP: 0033:0x7f2f256139bf
[Fri Feb 13 09:43:51 2026] RSP: 002b:00007f2f1fff90d0 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[Fri Feb 13 09:43:51 2026] RAX: ffffffffffffffda RBX: 0000560f561577d0 RCX: 00007f2f256139bf
[Fri Feb 13 09:43:51 2026] RDX: 00007f2f1fff95a0 RSI: 0000000000005a0e RDI: 000000000000000d
[Fri Feb 13 09:43:51 2026] RBP: 00007f2f1fffcb90 R08: 0000000000000000 R09: 00007f2f18088870
[Fri Feb 13 09:43:51 2026] R10: 00007f2f18000650 R11: 0000000000000246 R12: 00007f2f18029b20
[Fri Feb 13 09:43:51 2026] R13: 00007f2f18040c70 R14: 00007f2f1fff95a0 R15: 00007f2f1fff91a0
[Fri Feb 13 09:43:51 2026] </TASK>
Here's the deadlock:
- spa_vdev_attach() takes spa_config_lock and holds it while waiting for txg to advance
- txg_sync tries to get spa_config_lock and cannot advance txg
Maybe it's a known issue, but I could not find it reported. And as far as I can tell, the latest code in master is still exposed to this problem.
Describe how to reproduce the problem
A few mins before the stacks started appearing, there were I/O errors with one of the disks, and zed probably tried to attach draid hot spare to it to rebuild. The system was under normal but heavy I/O load.
Include any warning/errors/backtraces from the system logs
There were other threads trying to acquire spa_config_lock, not sure if they are relevant, but here are couple of their stacks also:
[Fri Feb 13 09:43:51 2026] INFO: task mmp:8608 blocked for more than 120 seconds.
[Fri Feb 13 09:43:51 2026] Tainted: P W O L 5.15.0-161-generic #171-Ubuntu
[Fri Feb 13 09:43:51 2026] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Fri Feb 13 09:43:51 2026] task:mmp state:D stack: 0 pid: 8608 ppid: 2 flags:0x00004000
[Fri Feb 13 09:43:51 2026] Call Trace:
[Fri Feb 13 09:43:51 2026] <TASK>
[Fri Feb 13 09:43:51 2026] __schedule+0x24e/0x590
[Fri Feb 13 09:43:51 2026] schedule+0x69/0x110
[Fri Feb 13 09:43:51 2026] cv_wait_common+0xf8/0x130 [spl]
[Fri Feb 13 09:43:51 2026] ? wait_woken+0x70/0x70
[Fri Feb 13 09:43:51 2026] __cv_wait+0x15/0x20 [spl]
[Fri Feb 13 09:43:51 2026] spa_config_enter+0xf9/0x120 [zfs]
[Fri Feb 13 09:43:51 2026] vdev_count_leaves+0x26/0x60 [zfs]
[Fri Feb 13 09:43:51 2026] mmp_thread+0x398/0x600 [zfs]
[Fri Feb 13 09:43:51 2026] ? mmp_write_uberblock+0x530/0x530 [zfs]
[Fri Feb 13 09:43:51 2026] thread_generic_wrapper+0x64/0x80 [spl]
[Fri Feb 13 09:43:51 2026] ? __thread_exit+0x20/0x20 [spl]
[Fri Feb 13 09:43:51 2026] kthread+0x12a/0x150
[Fri Feb 13 09:43:51 2026] ? set_kthread_struct+0x50/0x50
[Fri Feb 13 09:43:51 2026] ret_from_fork+0x22/0x30
[Fri Feb 13 09:43:51 2026] </TASK>
[Fri Feb 13 09:43:51 2026] </TASK>
[Fri Feb 13 09:43:51 2026] INFO: task app:324826 blocked for more than 120 seconds.
[Fri Feb 13 09:43:51 2026] Tainted: P W O L 5.15.0-161-generic #171-Ubuntu
[Fri Feb 13 09:43:51 2026] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Fri Feb 13 09:43:51 2026] task:rocket-s3 state:D stack: 0 pid:324826 ppid:324760 flags:0x00000000
[Fri Feb 13 09:43:51 2026] Call Trace:
[Fri Feb 13 09:43:51 2026] <TASK>
[Fri Feb 13 09:43:51 2026] __schedule+0x24e/0x590
[Fri Feb 13 09:43:51 2026] ? neigh_hh_output+0xa1/0x120
[Fri Feb 13 09:43:51 2026] schedule+0x69/0x110
[Fri Feb 13 09:43:51 2026] cv_wait_common+0xf8/0x130 [spl]
[Fri Feb 13 09:43:51 2026] ? wait_woken+0x70/0x70
[Fri Feb 13 09:43:51 2026] __cv_wait+0x15/0x20 [spl]
[Fri Feb 13 09:43:51 2026] spa_config_enter+0xf9/0x120 [zfs]
[Fri Feb 13 09:43:51 2026] zfs_blkptr_verify+0x36d/0x4b0 [zfs]
[Fri Feb 13 09:43:51 2026] arc_read+0x122/0x15c0 [zfs]
[Fri Feb 13 09:43:51 2026] ? dbuf_rele_and_unlock+0x540/0x540 [zfs]
[Fri Feb 13 09:43:51 2026] ? __cond_resched+0x1a/0x60
[Fri Feb 13 09:43:51 2026] ? do_raw_spin_unlock+0x9/0x10 [zfs]
[Fri Feb 13 09:43:51 2026] ? dnode_block_freed+0xdd/0x150 [zfs]
[Fri Feb 13 09:43:51 2026] dbuf_read_impl.constprop.0+0x2f2/0x490 [zfs]
[Fri Feb 13 09:43:51 2026] dbuf_read+0x1ba/0x5b0 [zfs]
[Fri Feb 13 09:43:51 2026] ? dmu_buf_hold_noread+0xc3/0x110 [zfs]
[Fri Feb 13 09:43:51 2026] dmu_buf_hold+0x66/0xa0 [zfs]
[Fri Feb 13 09:43:51 2026] zap_lockdir+0x51/0xb0 [zfs]
[Fri Feb 13 09:43:51 2026] zap_cursor_retrieve+0x1c9/0x2c0 [zfs]
[Fri Feb 13 09:43:51 2026] ? verify_dirent_name+0x20/0x40
[Fri Feb 13 09:43:51 2026] ? filldir64+0x3e/0x190
[Fri Feb 13 09:43:51 2026] zfs_readdir+0x13f/0x480 [zfs]
[Fri Feb 13 09:43:51 2026] ? copy_from_kernel_nofault+0x22/0xf0
[Fri Feb 13 09:43:51 2026] ? bpf_probe_read_kernel+0x1d/0x50
[Fri Feb 13 09:43:51 2026] ? bpf_prog_b4699462f99fbc07_save_and_send_event+0x89/0xa10
[Fri Feb 13 09:43:51 2026] ? bpf_prog_8324fb2dad103fa5_cf_security_file_permission_fentry_1807+0x112/0x474
[Fri Feb 13 09:43:51 2026] ? aa_file_perm+0x127/0x2a0
[Fri Feb 13 09:43:51 2026] zpl_iterate+0x51/0x80 [zfs]
[Fri Feb 13 09:43:51 2026] iterate_dir+0x9f/0x1d0
[Fri Feb 13 09:43:51 2026] __x64_sys_getdents64+0x80/0x120
[Fri Feb 13 09:43:51 2026] ? __ia32_sys_getdents+0x120/0x120
[Fri Feb 13 09:43:51 2026] ? syscall_trace_enter.constprop.0+0x9d/0x1c0
[Fri Feb 13 09:43:51 2026] x64_sys_call+0xf63/0x1fa0
[Fri Feb 13 09:43:51 2026] do_syscall_64+0x56/0xb0
[Fri Feb 13 09:43:51 2026] ? putname+0x59/0x70
[Fri Feb 13 09:43:51 2026] ? do_sys_openat2+0x8b/0x160
[Fri Feb 13 09:43:51 2026] ? __x64_sys_openat+0x55/0x90
[Fri Feb 13 09:43:51 2026] ? arch_exit_to_user_mode_prepare.constprop.0+0x1e/0xc0
[Fri Feb 13 09:43:51 2026] ? syscall_exit_to_user_mode+0x41/0x80
[Fri Feb 13 09:43:51 2026] ? clear_bhb_loop+0x60/0xb0
[Fri Feb 13 09:43:51 2026] ? clear_bhb_loop+0x60/0xb0
[Fri Feb 13 09:43:51 2026] ? clear_bhb_loop+0x60/0xb0
[Fri Feb 13 09:43:51 2026] ? clear_bhb_loop+0x60/0xb0
[Fri Feb 13 09:43:51 2026] ? clear_bhb_loop+0x60/0xb0
[Fri Feb 13 09:43:51 2026] entry_SYSCALL_64_after_hwframe+0x6c/0xd6
[Fri Feb 13 09:43:51 2026] RIP: 0033:0x40758e
[Fri Feb 13 09:43:51 2026] RSP: 002b:000000c0016fda00 EFLAGS: 00000212 ORIG_RAX: 00000000000000d9
[Fri Feb 13 09:43:51 2026] RAX: ffffffffffffffda RBX: 0000000000000008 RCX: 000000000040758e
[Fri Feb 13 09:43:51 2026] RDX: 0000000000002000 RSI: 000000c00003e000 RDI: 0000000000000008
[Fri Feb 13 09:43:51 2026] RBP: 000000c0016fda40 R08: 0000000000000000 R09: 0000000000000000
[Fri Feb 13 09:43:51 2026] R10: 0000000000000000 R11: 0000000000000212 R12: 000000c0016fdb70
[Fri Feb 13 09:43:51 2026] R13: 00000000070009a0 R14: 000000c0012828c0 R15: 0000000000000003
[Fri Feb 13 09:43:51 2026] </TASK>
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
Type: DefectIncorrect behavior (e.g. crash, hang)Incorrect behavior (e.g. crash, hang)