-
Notifications
You must be signed in to change notification settings - Fork 2k
Open
Labels
Type: DefectIncorrect behavior (e.g. crash, hang)Incorrect behavior (e.g. crash, hang)
Description
System information
| Type | Version/Name |
|---|---|
| Distribution Name | Ubuntu |
| Distribution Version | 24.04 LTS |
| Kernel Version | 6.8.0-64-generic |
| Architecture | x86_64 |
| OpenZFS Version | 2.3.3-1 |
Describe the problem you're observing
During a zpool scrub on a raidz1 pool with 8 SATA HDDs, the system experiences a kernel crash.
The crash consistently occurs early in the scrub process.
This leads to the txg_sync thread becoming blocked indefinitely. The system becomes partially unresponsive and requires a hard reboot to recover.
Describe how to reproduce the problem
- Boot into a system with ZFS 2.3.3 and Linux 6.8.0.
- Have a pool configured with
raidz1using 8 physical drives (WWN-based paths). - Start a
zpool scrubon the pool:zpool scrub storage
- Monitor with:
watch zpool status -v
- After some GB scanned, observe the system lock and crash in kernel.
Include any warning/errors/backtraces from the system logs
🔧 Live kernel messages captured:
jul 21 13:07:03 gresint-server kernel: BUG: unable to handle page fault for address: 00007970a8dcc605
jul 21 13:07:03 gresint-server kernel: RIP: 0010:zio_vdev_io_done+0x6e/0x240 [zfs]
jul 21 13:07:03 gresint-server kernel: ? zio_vdev_io_done+0x6e/0x240 [zfs]
jul 21 13:07:03 gresint-server kernel: ? zio_vdev_io_done+0x4e/0x240 [zfs]
jul 21 13:07:03 gresint-server kernel: zio_execute+0x94/0x170 [zfs]
jul 21 13:07:03 gresint-server kernel: ? __pfx_zio_execute+0x10/0x10 [zfs]
jul 21 13:07:03 gresint-server kernel: RIP: 0010:zio_vdev_io_done+0x6e/0x240 [zfs]
Additional Information
zpool status -v
pool: storage
state: ONLINE
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
scan: scrub in progress since Sun Jul 20 23:49:14 2025
7.19T / 125T scanned at 1.13G/s, 2.96T / 125T issued at 474M/s
884K repaired, 2.37% done, 3 days 02:55:32 to go
config:
NAME STATE READ WRITE CKSUM
storage ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
wwn-0x5000c500e8a8c504 ONLINE 0 0 9 (repairing)
wwn-0x5000c500e8a8a1f6 ONLINE 0 0 4 (repairing)
wwn-0x5000c500f6e5f0ab ONLINE 0 0 7 (repairing)
wwn-0x5000c500e8496d51 ONLINE 0 0 2 (repairing)
wwn-0x5000c500f6ee1532 ONLINE 0 0 3 (repairing)
wwn-0x5000c500e8b3a6f9 ONLINE 0 0 7 (repairing)
wwn-0x5000c500e88ed746 ONLINE 0 0 8 (repairing)
wwn-0x5000c500e8a8a0aa ONLINE 0 0 6 (repairing)
logs
ubuntu-vg/slog-lv ONLINE 0 0 0
cache
ubuntu--vg-l2arc--lv ONLINE 0 0 0
errors: No known data errors
zpool get all storage
NAME PROPERTY VALUE SOURCE
storage size 146T -
storage capacity 85% -
storage altroot - default
storage health ONLINE -
storage guid 13753470766290521828 -
storage version - default
storage bootfs - default
storage delegation on default
storage autoreplace off default
storage cachefile - default
storage failmode wait default
storage listsnapshots off default
storage autoexpand on local
storage dedupratio 1.00x -
storage free 20.7T -
storage allocated 125T -
storage readonly off -
storage ashift 12 local
storage comment - default
storage expandsize - -
storage freeing 0 -
storage fragmentation 36% -
storage leaked 0 -
storage multihost off default
storage checkpoint - -
storage load_guid 4324850870775088562 -
storage autotrim off default
storage compatibility off default
storage bcloneused 0 -
storage bclonesaved 0 -
storage bcloneratio 1.00x -
storage dedup_table_size 0 -
storage dedup_table_quota auto default
storage last_scrubbed_txg 0 -
storage feature@async_destroy enabled local
storage feature@empty_bpobj enabled local
storage feature@lz4_compress active local
storage feature@multi_vdev_crash_dump enabled local
storage feature@spacemap_histogram active local
storage feature@enabled_txg active local
storage feature@hole_birth active local
storage feature@extensible_dataset active local
storage feature@embedded_data active local
storage feature@bookmarks enabled local
storage feature@filesystem_limits enabled local
storage feature@large_blocks enabled local
storage feature@large_dnode enabled local
storage feature@sha512 enabled local
storage feature@skein enabled local
storage feature@edonr enabled local
storage feature@userobj_accounting active local
storage feature@encryption enabled local
storage feature@project_quota active local
storage feature@device_removal enabled local
storage feature@obsolete_counts enabled local
storage feature@zpool_checkpoint enabled local
storage feature@spacemap_v2 active local
storage feature@allocation_classes enabled local
storage feature@resilver_defer enabled local
storage feature@bookmark_v2 enabled local
storage feature@redaction_bookmarks enabled local
storage feature@redacted_datasets enabled local
storage feature@bookmark_written enabled local
storage feature@log_spacemap active local
storage feature@livelist enabled local
storage feature@device_rebuild enabled local
storage feature@zstd_compress enabled local
storage feature@draid enabled local
storage feature@zilsaxattr enabled local
storage feature@head_errlog active local
storage feature@blake3 enabled local
storage feature@block_cloning enabled local
storage feature@vdev_zaps_v2 active local
storage feature@redaction_list_spill enabled local
storage feature@raidz_expansion enabled local
storage feature@fast_dedup enabled local
storage feature@longname enabled local
storage feature@large_microzap enabled local
SMART data (excerpt)
Obteniendo todos los discos del pool ZFS...
Se encontraron los siguientes discos:
- wwn-0x5000c500e8a8c504
- wwn-0x5000c500e8a8a1f6
- wwn-0x5000c500f6e5f0ab
- wwn-0x5000c500e8496d51
- wwn-0x5000c500f6ee1532
- wwn-0x5000c500e8b3a6f9
- wwn-0x5000c500e88ed746
- wwn-0x5000c500e8a8a0aa
---------------------------------------------------------
SMART info para wwn-0x5000c500e8a8c504 (/dev/sda):
---------------------------------------------------------
Device Model: ST20000NM007D-3DJ103
Serial Number:
SMART overall-health self-assessment test result: PASSED
Reallocated_Sector_Ct: 0
Current_Pending_Sector: 0
Offline_Uncorrectable: 0
Temperature: 36°C
---------------------------------------------------------
SMART info para wwn-0x5000c500e8a8a1f6 (/dev/sdb):
---------------------------------------------------------
Device Model: ST20000NM007D-3DJ103
Serial Number:
SMART overall-health self-assessment test result: PASSED
Reallocated_Sector_Ct: 0
Current_Pending_Sector: 0
Offline_Uncorrectable: 0
Temperature: 36°C
---------------------------------------------------------
SMART info para wwn-0x5000c500f6e5f0ab (/dev/sdc):
---------------------------------------------------------
Device Model: ST20000NM007D-3DJ103
Serial Number:
SMART overall-health self-assessment test result: PASSED
Reallocated_Sector_Ct: 2
Current_Pending_Sector: 0
Offline_Uncorrectable: 0
Temperature: 36°C
⚠️ Reallocated sectors detectados
-----------------------------------
... (truncated)
Troubleshooting steps attempted:
- Tested with
zfs_deadman_failmodeset topanic,wait, andcontinue - Adjusted
zfs_vdev_scrub_max_activeandmin_active - Monitored
journalctl -klive during scrub - Verified all disks are SMART clean
Final Notes
This appears to be a kernel-space memory access bug triggered by ZIO completion under scrub load.
I'm available to test debug builds or apply custom patches if required.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
Type: DefectIncorrect behavior (e.g. crash, hang)Incorrect behavior (e.g. crash, hang)